clarin-eric / standards

work space for the Standards and Interoperability Committee
https://www.clarin.eu/content/standards
3 stars 13 forks source link

Decisions made when converting recommendations from centres to SIS #49

Open bansp opened 2 years ago

bansp commented 2 years ago

This is to serve as a place to gather notes on the conversion process from the "freestyle" recommendations by individual centres into the matrix defined by the SIS. While a lot of feedback has already been produced while mass-converting data, we may be able to notice some trends or patterns that will influence the design of the SIS.

The ordering of centres is random. Everyone is welcome to pick a centre or two and recode their annotations into the SIS (please please...).


DANS (Piotr) Extremely well described page, with general remarks but also comments on nearly each individual format, and always on (formal) categories of formats.

Level of recommendation: "preferred" and "non-preferred", where the latter is mostly SIS's "acceptable", but in some cases, like for plain text that is not UCS, the mapping would be into "deprecated".

Domains: not indicated explicitly. Formats are grouped into categories, but these are formal and not straightforwardly matching any kind of functional division. For example, the category Markup language maps onto documentation, potentially some classes of metadata, text source, annotated text, etc. (Also, image source, for SVG, which is however not mentioned in the description of that category). Result: assignment to domains within the SIS should be examined by a centre representative.

Action: I attempted to minimize the interpretation, so that I wouldn't introduce too many unwanted choices. Added links to DANS pages, though I usually aimed for pages describing a category of formats rather than individual formats themselves (to save time).


Your centre


hannahedeland commented 2 years ago

For now I'll simply be encoding additional specific information about e.g. parameters for audiovisual data or metadata profiles in the element, maybe we want to sort that out at some point, maybe not

hannahedeland commented 2 years ago

There are some incomplete names of formats, e.g. "Partitur" instead of "BAS Partitur Format" and no "EAF" mentioned for "ELAN", but it seems they also don't exist as formats yet, so I could maybe just add them, using more complete names?

bansp commented 2 years ago

There are some incomplete names of formats, e.g. "Partitur" instead of "BAS Partitur Format" and no "EAF" mentioned for "ELAN", but it seems they also don't exist as formats yet, so I could maybe just add them, using more complete names?

Oh please do! If you look up the KPI spreadsheet, I put format IDs there, and the green ones belong to still non-existing descriptions. The black ones have corresponding files under data/formats

hannahedeland commented 2 years ago

I'm using the spreadsheet IDs now, though I'm not sure about them (or the "abbr") - some e.g. have the tool name in the ID instead of the format name, but I guess it all really doesn't matter for now, so I don't mess with them of course :) Should I be changing the colour in the spreadsheet or will you be doing that after you've merged my stuff?

bansp commented 2 years ago

Oh, please don't hesitate to set the ID right, if you see something off. I went through the list in quite a hurry when I learned that Eliza was willing to devote some time to that and I wanted to prepare the ground for her. Just please make a note of which ID you're changing, and I'll hunt for it and modify the old one, if it's been used. And yes, please if you create a stub, then take the green away (it's a makeshift mark anyway, and there will be mistakes, in the end, but then, we can use them to test a sanity script that we are going to have at some point.

Aha, I've noticed that I wrongly assigned the same (green) ID to two different ELAN formats -- just please hack at them at will, the same goes for Exmaralda, if you feel that my decisions were wrong. And thanks :-)

bansp commented 2 years ago

Question (and it also concerns something that Hanna asked over e-mail, namely broadly understood granularity of description): can we just have "ECMAScript" and list ".js" as a possible extension there, so that we avoid digging up the whole history of JavaScript, JScript, ECMA, and the implementation details? ES as a format would then nicely link to the corresponding ECMA standard. OTOH, if we just have one format description file for the two (actually, for the many JSs out there), shouldn't we then have an alias mechanism, so that we can alias ECMAScript as JavaScript? I treat this as the sort of practical question that I expected to come up in the process of translating into SIS...

bansp commented 2 years ago

Regarding granularity, this is my decision concerning SPSS by DANS. The original says SPSS is recommended as one version ("flavour") but not as an SPSS-internal interchange format or SPSS native format. I gave it just one element:

<format>
    <name id="fSPSS">SPSS</name>
    <domain>Statistical Data</domain>
    <level>recommended</level>
    <comment>For general info, see <a
            href="https://dans.knaw.nl/en/about/services/easy/information-about-depositing-data/before-depositing/file-formats/statistical-data"
            >the DANS page for statistical data</a>. SPSS is recommended as <a
            href="http://dans.knaw.nl/en/about/services/easy/information-about-depositing-data/before-depositing/file-formats/data-and-setup"
            >"data and setup" (.dat/.sps)</a> rather than the <a
            href="http://dans.knaw.nl/en/about/services/easy/information-about-depositing-data/before-depositing/file-formats/spss-portable"
            >Portable</a> or <a
            href="https://dans.knaw.nl/en/about/services/easy/information-about-depositing-data/before-depositing/file-formats/statistical-data/spss-sav"
            >native</a> formats.</comment>
</format>

... and I'm not saying that this decision was correct -- merely recording it. The alternative would be to subdivide SPSS into a family of three formats.

bansp commented 2 years ago

For the record, I think now that the approach I took for SPSS is not optimal, because it is all about formats, and SPSS .dat/.sps is obviously a different format from SPSS .port and SPSS .sav, so hiding their existence the way I did above actually runs counter to the aims of this project. The simplest way I can think of is something that has already been used, by me when encoding TEI-based formats, and by Hanna when tackling MPEG-4 (MP4) formats, to have something like the "head of the family" and then the children (in tree terms) or the more specified variants. In essence, instead of one "fSPSS" above, I should have kept the identified for the "generalized" SPSS, and created at least three stubs for the particular variants. Making a note of that now, not sure if I can get to that before the delayed release (but I'll try, because it feels like a core issue).

hannahedeland commented 2 years ago

I'm using DC-XML for DC Metadata recommendations since we're doing formats now (but maybe recommendations are not always a list of formats but a list of a more generic type of SIS items with attributes for ID and type, or separate lists for formats and other standards, independent of their serialization? But let's not make things more complicated).

bansp commented 2 years ago

I am not sure I fully understand the previous comment, @hannahedeland . However, what you made me think of is that "SPSS" doesn't probably make sense as a format family, but rather as a keyword (just to link them, somehow, in the absence of a standard that could indirectly link them). Soo, the principle I will now try to follow is to use a keyword for various formats that e.g. can be exported by a tool (that could of course take one too far, if a tool exports, say, .txt... but let me try to be "commonsensical" about that), whereas, if those formats differ in that, say, one is XML-based, and the other plain text, they will be hooked under those format families (XML and plainText), because that's their formal categorization.