clarin-eric / standards

work space for the Standards and Interoperability Committee
https://www.clarin.eu/content/standards
3 stars 13 forks source link

Check where data-deposition centres point for format recommendations #14

Open bansp opened 3 years ago

bansp commented 3 years ago

Date stamp: 23 July 2024 $${\color{red}\text{This description is going to be frozen on the morning of Friday the 26th}}$$

0. Introduction

0.1. About this very note

This lead comment of ticket #14 in the repository of the Standards Information System (SIS) is part of the release packages of CLARIN recommendations for data-deposition formats, and as such it gets updated at least once per release. The minimal amount of information expected to change cyclically is an update of the date string, after having confirmed that the information is current. Ideally, we are hoping for the centres listed in sections 1-4 below to gravitate towards section 5, which lists centres that maintain information in the SIS.

The information contained herein is meant to assist various bodies in the broadly conceived CLARIN governance (notably the Standards and Interoperability Committee, the BoD and the NCF, the Assessment Committee, and especially the Technical Centres Committee) in their review- and decision-making processes.

This page is located at https://github.com/clarin-eric/standards/issues/14 . Please post remarks, updates and/or corrections in the comments section at the bottom.

0.2. General introduction (a.k.a. "Why bother")

CLARIN centres often offer deposition services. B-centres that offer such services are obligated (this is an (re-)assessment precondition, formulated as part of the CoreTrustSeal requirements) to publish explicit information about data formats that they recommend for depositions. For non-B-centres, this is not a requirement, but it is not uncommon, depending on the centre's profile and infrastructure. That obligation/practice has been encoded in one of CLARIN's Key Performance Indicators, using the following measurement: "percentage of centres offering repository services that have published an overview of formats that can be processed in their repository". (Thus the KPI measurement encompasses centres with deposition services, whereas the CTS requirement pertains to B-centres with deposition services, i.e., a subset of the KPI target group; for more details, quotes and references, see section 4 of chapter "Standards in CLARIN", by Piotr Bański and Hanna Hedeland (2022), in the CLARIN Book.)

Before the SIS took wing, the requirement / good practice of publishing explicit information on recommended formats had been addressed in the following ways:

  1. publishing the information somewhere at the centre (or consortium) homepage;
  2. not publishing that information and instead directing users to by now obsolete sets of recommendations (called "external guidelines" in what follows) that are far too general to represent the given centre's research profile;
  3. using a mixture of the above approaches.

There was also a fourth group, consisting of centres with deposition services that wouldn't publish such information at all, not even as a link. No need for finger-pointing here -- that group is only mentioned for the sake of completeness.

This very ticket is devoted mainly to collecting information on centres with depositing services that point to external sources of information on format recommendations. In other words, we are looking mainly at groups 2 and 3.

The reason for listing this information is:

Eventually, the format recommendations are expected to be collected in the Standards Information System. It is possible for centres to store that information in the SIS, and to present it to users with a dedicated link, such as:

https://standards.clarin.eu/sis/views/view-centre.xq?id=IDS

It is also possible to retrieve the information from the SIS already pre-structured, as XML, to be styled according to the given centre's guidelines and publish on that centre's pages, this way avoiding the chore of maintaining two separate sets of data (for more on that, see the API section of the SIS).

0.3. Methodology

The primary resource assumed for this task is the CLARIN centre registry at https://centres.clarin.eu/ .

Two secondary resources are:

The secondary resources appear to depend on the CLARIN centre registry and a degree of hand-crafting (and therefore a potential update lag) may probably be assumed of the depositing-services page.

A tertiary resource is the list provided by the SIS, at https://standards.clarin.eu/sis/views/list-centres.xq . While one might be tempted to assume that that list should be at least semi-automatically derived from the centre registry, it actually provides a small potential layer of indirection, at least in two aspects: firstly, we allow centres to override the shorthand handles that are listed in the registry (and thus, for example, at the centre's request, "CLARINSI" is listed as "CLARIN.SI") and, secondly, we are prepared for a degree of "ontological" or organisational variability in the case of centres that act as nodes in more than a single research-infrastructure network. In short: centres can influence their listings in the SIS in various ways, independent of the CLARIN registry.

The B-centre status is conditioned upon a successful round of certification, managed internally by the Assessment Committee, and externally by a certification authority, currently the CTS. (Note, incidentally, that full-fledged methodology would probably ideally start from the CTS database as the primary source, but do forgive us for not trying to shoot gnats with rockets -- the amount of time allocated to this already extensive exercise should be reasonable). The CLARIN registry has various status strings for centres that wish to achieve the B-status, whether for the first time or having lost it and preparing for another certification round -- it is, as of June 2024, "Aiming for B", "Aiming for B.", "aiming for B" (kudos for consistency) but also "none" or "Certification expired, renewal planned". In the present note, all such centres, together with "regular" C-centres, are going to be treated as "Non-B centres". Note that, especially for centres tagged as "none" in the registry, some degree of network-internal knowledge is going to be necessary for stating which of the centres are temporarily not B only because they are getting, or preparing to get, re-certified. There's no guarantee that that knowledge is perfect, so this is a weak point in the methodology.

The tables below are constructed by scrolling along the CLARIN registry and the SIS list in parallel, taking into account (a) B-centres and (b) non-B centres with deposition services that are known as such (note: this is a weak spot, some may escape, and the secondary CLARIN list is not trusted fully). In the process, the SIS list is updated wrt the CLARIN registry, and the result is sorted into the categories provided in the sections that follow. Doubts that arise wrt to the nature of individual centres are usually signalled by issues using the "centre data" label.

0.4. Terminology

When, in what follows, a centre is said to "point to external guidelines", those guidelines are in too many cases general, top-down, coarse-grained standards recommendations that were formulated well over 10 years ago and were meant for a purpose different than informing users about centre-particular recommendations on what kinds of data the given centre can handle or is interested in handling. While such pointers are surely provided in good faith, they can at best be considered tricks for passing CTS certification. Otherwise, for practical purposes, they don't get the thing done.

Another piece of terminology: "listed in the SIS" vs "curated": some of the content in the SIS comes from rather quick import of information that was structured rather differently back when the Standards Committee worked with spreadsheets. A lot of interpretation happened on the way between spreadsheets and the SIS, justified by the hope that the centres would quickly want to fix that if they were not happy with the outcome. It later turned out that we were a non-tiny bit too hopeful about that. The "legacy" listings, not approved by the particular centres, are accompanied by a warning in red. When, on the other hand, a centrer decides to hold an inputhon and submits the result to the SIS, such recommendations are considered curated and the red warning is replaced with the name(s) of the curator(s).

$${\text{ *}}$$

What follows is information on how the particular CLARIN centres publish format recommendations or how they do not publish that info while nevertheless trying to satisfy the CLARIN-internal as well as CTS-imposed requirements. Note the date stamp at the top of this note and please do not hesitate to let us know (ideally: in the comments below) if you see that some info can/should be updated or fixed.

1. Centres that point solely to external guidelines

This section lists centres that do not provide information specific to their research profiles but rather point to general and coarse-grained information provided by CLARIN quite a while ago, in most cases in 2009 (the "LRT Standards" document, which simply doesn't help and mentions obsolete standards).

Note: it is good not to be mentioned in this section.

Methodological note: it is possible that the centres below also provide their own recommendations or even point at the SIS for that -- that hopefully depends on how late after June 2024 you are reading this. If you discover that a centre should vanish from this section, please let us know in a comment below.

1.1. B-centres

Recall that B-centres are obligated (by CTS Requirement 8) to provide explicit information on what formats they are willing to process in the deposition process. The centres below instead point to general and at least partly obsolete guidelines. Amending this situation is at this point easy: deposit the relevant information directly in the SIS -- and then point to that description.

Centre LastChecked LinkTarget SourcePage
CLARIN-LV 18-06-2024 http://www.clarin.eu/sites/default/files/Standards%20for%20LRT-v6.pdf https://repository.clarin.lv/repository/xmlui/page/faq
CLARIN-PL1 12-06-2024 http://www.clarin.eu/sites/default/files/Standards%20for%20LRT-v6.pdf https://clarin-pl.eu/dspace/page/faq#what-submissions-do-you-accept
ILC4CLARIN 13-06-2024 http://www.clarin.eu/sites/default/files/Standards%20for%20LRT-v6.pdf https://dspace-clarin-it.ilc.cnr.it/repository/xmlui/page/faq#what-submissions-do-you-accept
LINDAT 13-06-2024 http://www.clarin.eu/sites/default/files/Standards%20for%20LRT-v6.pdf https://lindat.mff.cuni.cz/repository/xmlui/page/faq?locale-attribute=en#what-submissions-do-you-accept

1.2. Non-B centres

These centres are not obligated to explicitly publish information about what formats they recommend for deposition. However, that is both useful for the users themselves, and also crucial for satisfying the relevant CLARIN KPI. Also, these centres are listed as aiming for the "B" status (click on "Type status" to sort them at the top), so at some point they will need to undergo CTS assessment -- why not be proactive in this respect.

Centre LastChecked LinkTarget SourcePage
CLARIN-LT 13-06-2024 http://www.clarin.eu/sites/default/files/Standards%20for%20LRT-v6.pdf https://clarin.vdu.lt/xmlui/page/faq#what-submissions-do-you-accept
ERCC 13-06-2024 http://www.clarin.eu/sites/default/files/Standards%20for%20LRT-v6.pdf https://clarin.eurac.edu/repository/xmlui/page/faq#what-submissions-do-you-accept
SADiLaR 17-06-2024 http://www.clarin.eu/recommendations
https://archive.mpi.nl/accepted-file-formats
https://sadilar.org/en/submit-a-resource/

1.3. Special mention: curated recommendations in the SIS but only (?) pointing to external guidelines

This category is hopefully only temporary -- I need to record the situation as it is, and will be happy to amend the entry as soon as I get information that the situation has changed. CLARINO_Bergen (B), IDS (C), SAW (B), Språkbanken (B) and OTA (C) have deposited their recomendations in the SIS (click on their names, below), so they are actually at the forefront and kudos to them, and yet, from their DSpace instances used to manage the deposition process, or in their homepages, the link to the LRT PDF (or equivalent) is still present, instead of a link to the SIS.

The way to get out of this section and into section 5 is trivial: edit the centre homepage or (more often) the DSpace "FAQ" section to replace the link to the LRT PDF with a dedicated link to the centre's section in the SIS.

Centre LastChecked LinkTarget SourcePage
Språkbanken 13-06-2024 http://www.clarin.eu/sites/default/files/Standards%20for%20LRT-v6.pdf https://repo.spraakbanken.gu.se/xmlui/page/faq#what-submissions-do-you-accept

2. Centres that point to external guidelines in addition to publishing own information locally

There is nothing wrong in pointing to an external source in addition to the centre's own recommendations published on the centre's own pages, especially if the external resource brings in some extra value (see ACDH-ARCHE for an example, pointing to Archeology Data Service recommendations). On the other hand, it is not so good to point to obsolete, unhelpful or misleading documents.

The role of this section is basically informative, though with a little request for sharing at least the positive recommendations in the SIS, to enable aggregation of this information. There seems to be no need to split the centres listed here into B- and non-B-.

Centre LastChecked LinkTarget SourcePage Own info
ACDH-ARCHE 13-06-2024 a.o. https://www.clarin.eu/content/standard-recommendations https://arche.acdh.oeaw.ac.at/browser/formats-filenames-and-metadata same
BBAW 18-06-2024 https://www.clarin-d.net/en/language-resources-and-services/user-guide https://clarin.bbaw.de/en/repo/ same
CLARIN-CH 04-07-2024 a.o. https://infoscience.epfl.ch/record/265349 https://clarin-ch.ch/documentation-platform/standard-data-formats https://clarin-ch.ch/documentation-platform/metadata-standards
CLARIN.SI 16-06-2024 http://www.clarin.eu/sites/default/files/Standards%20for%20LRT-v6.pdf
https://www.clarin.eu/content/standards-and-formats
https://www.clarin.si/repository/xmlui/page/data same
DH-REP 18-06-2024 https://files.dnb.de/nestor/materialien/nestor_mat_08_eng.pdf https://repository.de.dariah.eu/doc/services/data-policies.html#recommendations-and-list-of-preferred-formats same
ORTOLANG 18-06-2024 https://www.clarin.eu/content/standard-recommendations https://www.ortolang.fr/en/help/data-formats/ https://facile.cines.fr/ via https://www.ortolang.fr/en/home/about/
TGrep 18-06-2024 https://files.dnb.de/nestor/materialien/nestor_mat_08_eng.pdf https://textgridlab.org/doc/services/data-policies.html#preferredformats same
UdS 18-06-2024 http://www.clarin.eu/recommendations https://fedora.clarin-d.uni-saarland.de/ressources/AcceptedFormats.en.pdf same, via https://fedora.clarin-d.uni-saarland.de/depositors.en.html

3. Centres that neither point anywhere nor publish their own explicit information

This set of centres should be empty, unless the centre does not offer deposition services (in which case, it shouldn't be listed here, so... this set should be empty). Please note that rather than amending the existing lack of recommendations on their own home pages, the best course of action for these centres may be to deposit the information directly in the SIS, and then point to that listing. You do one inputhon and Bob's your... list.

3.1. B-centres

"Absence of evidence is not evidence of absence", and it might be that the centres here do publish their own recommendations, in a non-obvious corner of their homepages. Please feel very welcome to post a comment below if you are able to share info on that. Note also that if the info is hidden then it's not really easily available to the depositing users, and ensuring availability of the information is part of the reason for this entire exercise.

Centre LastChecked Comment DepositionPage
CLARIN-IS 15-06-2024 (no info) https://clarin.is/en/services/

3.2. Non-B centres

Some of these centres are listed as "aiming for B", some used to be B. All of them indicate that they provide deposition services.

Centre LastChecked Comment DepositionPage
CELR-EKK 12-06-2024 "all data is accepted", via Entu https://www.keeleressursid.ee/en/services
IMS 15-06-2024 no real recommendations https://wiki.ims.uni-stuttgart.de/extern/CLARIN-D
15-06-2024 "please contact us" http://clarin04.ims.uni-stuttgart.de/repo/
MI 19-06-2024 "please contact [us]" https://meertens.knaw.nl/meertens-collectie/research-data-management/
https://meertens.knaw.nl/en/archive/depositing-data_eng_/

4. Centres that only publish their own, local recommendations

Note that this satisfies both the CTS requirement and the KPI calculation (except the KPI calculation performed dynamically by the SIS). Unfortunately, it also ensures a gap in the SIS-derived statistics that might otherwise benefit the entire network. It would be greatly appreciated if at least the data formats recommended by the centres could make it into the SIS. The table below includes both B- and C-centres.

Note: Formally, all these centres have done a splendid job. Adding their recommendations to the SIS would be a nice bonus to ensure more accurate statistics.

Centre LastChecked Linked from / Comment InfoPage
BAS 18-06-2024 https://clarin.phonetik.uni-muenchen.de/BASRepository/index.php https://www.phonetik.uni-muenchen.de/Bas/BasPolicyExternalResources_eng.pdf
CLARIN-DK 18-06-2024 https://repository.clarin.dk/repository/xmlui/page/faq#what-data-formats-are-accepted https://repository.clarin.dk/repository/xmlui/page/formats
CLARIN:EL 18-06-2024 https://www.clarin.gr/en/services/share https://www.clarin.gr/sites/default/files/CLARINELRecommendedFormats.pdf
CMU 18-06-2024 https://talkbank.org/ https://talkbank.org/share/contrib.html
COCOON 18-06-2024 https://cocoon.huma-num.fr/exist/crdo/faq.htm?lang=en https://cocoon.huma-num.fr/exist/crdo/formats.htm?lang=en
DANS 18-06-2024 https://dans.knaw.nl/en/depositing-data-manual/before-depositing_ds/ https://dans.knaw.nl/en/file-formats/
EKUT 18-06-2024 menu on the main page https://talar.sfb833.uni-tuebingen.de/datamanagement/
IVDNT 17-06-2024 access to the recommendations is not obvious https://portal.clarin.inl.nl/doc/information_about_deposition_INT.pdf
LAC 18-06-2024 https://dch.phil-fak.uni-koeln.de/bestaende/language-archive-cologne/user-guides https://dch.phil-fak.uni-koeln.de/bestaende/language-archive-cologne/user-guides/format-whitelist
MPI-PL 18-06-2024 https://archive.mpi.nl/tla/ + "Help" https://archive.mpi.nl/tla/accepted-file-formats
TROLLing 18-06-2024 https://site.uit.no/dataverseno/deposit/ https://site.uit.no/dataverseno/deposit/prepare/
ZIM 18-06-2024 https://informationsmodellierung.uni-graz.at/en/about-the-department/research-data-repository-gams/ https://gams.uni-graz.at/context:gams?mode=about&locale=en

5. Centres that point at their curated recommendations in the SIS

This is where all (or most of) the centres listed above should ideally end up -- what is needed for them is to maintain the information served by the SIS and explicitly link to it. ("Ideally" from the point of view of contributing to the aggregated information; note that for centres in sections 2 and 4, this is a matter of willingness and sparing the time; they are otherwise fine from the point of view of certification and KPI calculation done by hand, rather than in the SIS).

Centre LastChecked SourcePage
CLARINO_Bergen 19-06-2024 https://repo.clarino.uib.no/xmlui/page/faq#what-submissions-do-you-accept
FIN-CLARIN 14-06-2024 https://www.kielipankki.fi/tuki/tekninen-muoto/
https://www.kielipankki.fi/tuki/korp-formaatti/
IDS 17-06-2024 https://repos.ids-mannheim.de/reposdescription.html
OTA 02-07-2024 http://www.clarin.eu/sites/default/files/Standards%20for%20LRT-v6.pdf https://llds.ling-phil.ox.ac.uk/llds/xmlui/page/faq#what-submissions-do-you-accept
PORTULAN 08-07-2024 https://portulanclarin.net/usage/#how
SAW 19-06-2024 https://repo.data.saw-leipzig.de/depositing/en

6. Conclusions

6.1. One conclusion that should be drawn from the picture above that the FAQ contained in the LINDAT customisation of DSpace (the deposition system that unifies many repositories, currently) should no longer point users at the LRT PDF but

6.2. Following up on the above, the default landing page (content/standard-recommendations) should, at the top, point at the combined recommendations in the SIS. (that got handled on 17-06-2024)

6.3. Centres which provide their own extensive recommendations will hopefully be willing to share at least their recommended (as opposed to accepted and discouraged) formats, so that (a) the KPI can be properly calculated in the SIS, and (b) so that the statistics of popular formats are not skewed due to the lack of data coming from those centres.

bansp commented 3 years ago

Let us edit the leading note, to extend (or shrink!) the table. The table might be a good addendum to the April release, so let me set a milestone here.

TomazErjavec commented 3 years ago

I agree, interesting table, thanks!

bansp commented 3 years ago

I seem to recall Leif-Jöran pointing me to a stock data deposition guidelines for the Sprakbanken (pointing at the "Standards for LRT" PDF, I think), but I am unable to locate that page now at https://spraakbanken.gu.se/en

bansp commented 3 years ago

I seem to recall Leif-Jöran pointing me to a stock data deposition guidelines for the Sprakbanken (pointing at the "Standards for LRT" PDF, I think), but I am unable to locate that page now at https://spraakbanken.gu.se/en

Thanks to Hanna for digging up https://repo.spraakbanken.gu.se/xmlui/page/faq#what-submissions-do-you-accept for me. I spent quite a while at the Sprakbanken site yesterday night, trying to find my way to the deposition guidelines as a "naive user coming from outside", which makes me wonder how (un)easy it is to find that. I'd be grateful for an independent check. And in the meantime, I'll update the table.

bansp commented 3 years ago

OK, the way to the deposition info at Gothenburg is through the "Tools" in the menu, then one has to navigate to the item mentioning CLARIN, and that takes them to the repository page. So I simply overlooked this route yesterday.

bansp commented 3 years ago

I've just gone through all the links and verified that the info is current. Getting the ticket exported as PDF requires a lot of tinkering in the "Inspect" box and then using custom zoom (55%) in the print window.

bansp commented 2 years ago

Update needed to mention Iceland: https://repository.clarin.is/repository/xmlui/page/faq#what-submissions-do-you-accept

Apart from that, divide the list into three rather than two, with a separate part for centres that do mention their own recommendations while also referencing the "LRT standards" document (mostly that one, because of the stock FAQ).

bansp commented 2 years ago

Costanza Navarretta has just kindly pointed me to the recommendations for CLARIN-DK: https://info.clarin.dk/en/the-clarin-dk-infrastructure/recommended-standards-and-formats/

I'll redo the table in 3 parts within daaays, I hope.

bansp commented 2 years ago

(Ah, the reason CLARIN-DK isn't mentioned above is that it doesn't point externally, and the topic of this ticket is external pointers)

bansp commented 2 years ago

While updating the info, I'm unable to access https://repo.spraakbanken.gu.se/xmlui -- making a note of that here, to check again on Monday.

bansp commented 2 years ago

I will take a snapshot of the ticket on Monday afternoon, publish the snapshot and reset the milestone to 1.1.

bansp commented 2 years ago

Posted a snapshot, moving the ticket to milestone 1.1.

bansp commented 1 year ago

The DSpace FAQ is here, I believe: https://github.com/ufal/clarin-dspace/blob/clarin/dspace-xmlui/src/main/webapp/themes/UFAL/lib/html/faq.html

bansp commented 11 months ago

I have gone through the tables above and posted short updates. In a few cases, I crossed a centre out (that's actually good!), in one case, I moved a centre from "aiming at B" to "B" (congrats!). This was a quick check, so in case you seen an error or omission, please post a note here.

Overall impression after 2 years: revolution still needs to happen.

bansp commented 10 months ago

Maria Gavriilidou has sent me the following info on CLARIN:EL: