clarin-eric / standards

work space for the Standards and Interoperability Committee
https://www.clarin.eu/content/standards
3 stars 12 forks source link

Remove programming languages? #265

Open TomazErjavec opened 2 months ago

TomazErjavec commented 2 months ago

Much as I love Perl, if find it strange that it is included in the formats:

Just my 2 late night cents.

bansp commented 1 month ago

Thank you, Tomaž. It does seem to be outside the pattern. Gonna handle that when I'm back from vacation.

Maybe code should be handled by 'plain text in a (new) specific domain' rather than merely plain text in tool support (which is where, I guess, both these non-formats should be placed within the current system).

The domain system should be modified anyway to handle experimental-linguistic formats. Maybe a good topic for the upcoming SIC meeting.

bansp commented 1 month ago

We have found out that the programming language reference is only virtual: Lisp and Perl are referenced by the unmaintained recommendations by EKUT and CLARIN.SI. "Unmaintained" means that they must have been transferred by hand into the SIS from the "Spreadsheet Era" ;-)

The current state is indeed erroneous also from the point of view of the domain used: image -- these entries are artefacts from hand-conversion that took place in the early days of the new SIS.

I am going to fix all three (remove the first two, modify the VRT).

TomazErjavec commented 1 month ago

Lisp and Perl are referenced by the unmaintained recommendations by EKUT and CLARIN.SI.

This is/was a bit of vicious circle, at least for CLARIN.SI: because I saw that the two prog. languages were an option, and we have nothing against having programs in these two languages in the repo, I ticked them. In other words, don't feel (as it seems you won't) from deleting them, just because CLARIN.SI allows them.

bansp commented 1 month ago

Hi Tomaž, Ah, it took a bit of a sentimental journey to the Spreadsheet Era for me to understand what you meant by "ticking the options" -- indeed, back then, Perl, Lisp and R were present in the list of formats.

I'd rather not at this point trace how they became part of the recommendations and why they were placed in a rather inappropriate domain -- that was probably due to some quick decision making when encoding the content of the CSC spreadsheet in the SIS. You probably recall that we (Eliza, Hanna, and myself) did a very very quick job of up-converting the content of the spreadsheet, in the hope that the result is going to be soon afterwards improved by each centre in turn. (Naive youngsters... :-))

In my commit registered above, I missed the part where I should have deleted the two references. Now I've done that, in line with your initial comment, whose sentiment I share.

Still, that does leave us with a task of how to recommend to centres what to do when they want to say that they are OK with Perl and Python and even Lisp, but not OK with Microsoft Basic or something such. One place is the general info, another, that feels like a hack, is something like "plain text" in the domain of "Tool Support" -- totally unintuitive, I'm afraid. We could also have a fake format file that is called "Programming Language" and has all the characteristics of (Unicode) plain text, and, again, uses the comment section for the name. Feels relatively bad as well.

TomazErjavec commented 1 month ago

Still, that does leave us with a task of how to recommend to centres what to do when they want to say that they are OK with Perl and Python and even Lisp, but not OK with Microsoft Basic or something such.

Presonaly, I'd just ignore programming languages, and concentrate exclusivelly on language resources in SIS. The two are really different, and if a centre wants to say Lisp yes, Basic no (or, more likely, source yes, compiled code no) then they should say it somewhere on their pages, and we could leave it up to them.

Also, the current "Tool Support" has, in my mind, nothing to do with toos. DTD, TAR etc does not really seem like tool support to me.

bansp commented 3 weeks ago

Good point about ZIP, TAR, and friends. I'm not at all sure that there is a domain where they fit, other than... "Other".

bansp commented 1 week ago

I've only recently seen a centre page that specifically stated that they want tools as well (gosh, I've seen too many in a short time, can't recall which centre it was), and by "tools" they did mean source code. So maybe a generic fSrcCode (text/plain) or something like that...?

bansp commented 1 week ago

Ah, it's stated in the DSpace-derived FAQ, e.g. here: https://clarin-pl.eu/dspace/page/faq#what-submissions-do-you-accept I probably meant that.

but also trained language models, parsers, taggers, MT systems, linguistic web services

No specific mention of source code, but it seems it's inferrable for some of the above.

TomazErjavec commented 1 week ago

So maybe a generic fSrcCode (text/plain)

Maybe, although I still think that removing the whole "programming languages" dimension from the SIS would be better.

bansp commented 1 week ago

I've found the quote I probably had in mind, above. It comes from BAS:

The BAS also under certain circumstances accepts software as a linguistic resource, if the software's aim is the analysis, processing and/or administration of scientific phonetic data. (source)

... so there's "popular demand" of a sort, it seems.

Having a single description, like fSrcCode, would maybe be an appropriate level of compromise: not encouraging a proliferation of descriptions of particular languages, and at the same time highlighting the source aspect (as opposed to compiled). The <comment> field would then be a place for stating that, e.g. Perl is discouraged and Python loved, etc.

TomazErjavec commented 1 week ago

Having a single description, like fSrcCode, would maybe be an appropriate level of compromise

Sure, if you don't want to ignore them completely, this would be the way to go I think.