IQSS / dataverse

Open source research data repository software
http://dataverse.org
Other
879 stars 492 forks source link

Make Dataverse produce valid DDI codebook 2.5 XML #3648

Closed jomtov closed 1 year ago

jomtov commented 7 years ago

Forwarded from the ticket: https://help.hmdc.harvard.edu/Ticket/Display.html?id=245607


Hello, I tried to validate two items exported to DDI from dataverse.harvard.edu with codebook.xsd (2.5) and got the same types of validation errors described below for item1 (below the line, should work as a well-formed xml-file):

Item 1:https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/BAMCSI

Item 2: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/P4JTOD

What could be done about it (else than meddling with the schema?)

Best regards,

Joakim Philipson Research Data Analyst, Ph.D., MLIS Stockholm University Library

Stockholm University SE-106 91 Stockholm Sweden

Tel: +46-8-16 29 50 Mobile: +46-72-1464702 E-mail: joakim.philipson@sub.su.se http://orcid.org/0000-0001-5699-994X

<?xml version="1.0" encoding="UTF-8"?>

What’s in a name? : Sense and Reference in biodiversity information doi:10.7910/DVN/BAMCSI Harvard Dataverse 2017-01-12 1 Philipson, Joakim, 2017, "What’s in a name? : Sense and Reference in biodiversity information", doi:10.7910/DVN/BAMCSI, Harvard Dataverse, V1 Medicine, Health and Life Sciences Computer and Information Science Metadata PID system Biodiversity Taxonomy "That which we call a rose by any other name would smell as sweet.” Shakespeare has Juliet tell her Romeo that a name is just a convention without meaning, what counts is the reference, the 'thing itself', to which the property of smelling sweet pertains alone. Frege in his classical paper “Über Sinn und Bedeutung” was not so sure, he assumed names can be inherently meaningful, even without a known reference. And Wittgenstein later in Philosophical Investigations (PI) seems to deny the sheer arbitrariness of names and reject looking for meaning out of context, by pointing to our inability to just utter some random sounds and by that really implying e.g. the door. The word cannot simply be separated from its meaning, in the same way as the money from the cow that could be bought for them (PI 120). Scientific names of biota, in particular, are often descriptive of properties pertaining to the organism or species itself. On the other hand, in semantic web technology and Linked Open Data (LOD) there is an overall effort to replace names by their references, in the form of web links or Uniform Resource Identifiers (URIs). “Things, not strings” is the motto. But, even in view of the many "challenges with using names to link digital biodiversity information" that were extensively described in a recent paper, would it at all be possible or even desirable to replace scientific names of biota with URIs? Or would it be sufficient to just identify equivalence relationships between different variants of names of the same biota, having the same reference, and then just link them to the same “thing”, by means of a property sameAs(URI)? The Global Names Architecture (GNA) has a resolver of scientific names that is already doing that kind of work, linking names of biota such as Pinus thunbergii to global identifiers and URIs from other data sources, such as Encyclopedia of Life (EOL) and uBio Namebank. But there may be other challenges with going from a “natural language”, even from a not entirely coherent system of scientific names, to a semantic web ontology, a solution to some of which have been proposed recently by means of so called 'lexical bridges'. Philipson, Joakim Philipson, Joakim 2017-01-12 Philipson, Joakim Summary Data Description Description Information about the and geographic coverage of the study and unit of analysis. CC0 Waiver

dataverse_1062_philipsonErrorTypes.txt

landreev commented 1 year ago

Ouch. It's insane that we've only gotten around to fixing it now. Much of these are simply matters of our code writing elements in random order, where the schema defines a<xs:sequence> - not that difficult to fix. Though there is a couple of non-trivial things where decisions need to be made; (what to do with the bounding boxes, for example).

landreev commented 1 year ago

Just to clarify a couple of things from an earlier discussion:

sizing:

* We will address the immediate issue of the bad ddi xml exports by looking specifically at what has been reported.
...
* If we find that the validator needs work, we will create a new separate issue when this is complete

"Looking specifically at what has been reported" may not easily apply. This is a very old issue, with a lot of back-and-forth (that's very hard to read), and many of the things reported earlier have already been fixed in other PRs. So I assumed that the goal of the PR was "make Dataverse produce valid DDI". (i.e., if something not explicitly mentioned here is obviously failing validation, it needed to be fixed too - it did not make sense to make a PR that would fix some things, but still produce ddi records that fail validation; especially since people have been waiting for it to be fixed since 2017).

The previously discussed automatic validation - adding code to the exporter that would validate in real time every ddi record produced, and only cache it if it passes the validation - does make sense to be left as a separate sprint-sized task. (the validation itself is not hard to add; but we'll meed to figure out how to report the errors). I have enabled the validation test in DDIExporterTest.testExportDataset() however, so, in the meantime, after we merge this PR, any developer working on the ddi exporter will be alerted if they break it by introducing something invalid, because they won't be able to build their branch.

To clarify, in the current state, the exporter in my branch is producing valid ddi xml for our control "all fields" dataset, plus all the other datasets used in our tests, and whatever I could think of to test. It does NOT guarantee that there is no possible scenario where it can still output something illegal! So, yes, it is important to add auto-validation. And, if and when somebody finds another such scenario, we will treat it as a new issue.

A couple of arbitrary decisions had to be made. I will spell it out in the PR description. My general approach was, if something does not translate from our metadata to the ddi format 1:1, just drop it and move on. We don't assume that it's a goal, to preserve all of our metadata when exporting DC, it's obvious that only a subset of our block fields can be exported in that format. But it's not a possibility with the ddi either, now that we have multiple blocks and the application is no longer centered around quantitative social science. So, no need to sweat a lost individual field here and there.

kaczmirek commented 1 year ago

To check compatibility I use the following two validators:

  1. BASE http://oval.base-search.net/ (this shows the new error "No incremental harvesting" in 12.1. I suggest adding this validator to the validation pipeline)
  2. CESSDA https://cmv.cessda.eu/#!validation with settings validation Gate= BASIC and Profile = CESSDA DATA CATALOGUE (CDC) DDI2.5 PROFILE - MONOLINGUAL: 1.0.4 This gives both schema violations and constraint violations (the latter are probably not relevant for Dataverse because the constraints of the profile can differ from what the Dataverse project wants to see. Although it would be good the add the attributes and tags that are recommended in the Gate = STANDARD) It is important to pass these two validators because this can result in being included and findable in a lot of aggregators like OpenAire, ELIXIR, B2FIND (https://b2find.eudat.eu/) which are all important players in Europe and with respect to the European Open Science Cloud (EOSC), etc. Currently, we have local fixes at several Dataverse installations to pass the validators (I only looked at the ones participating in CESSDA in Europe).
landreev commented 1 year ago

@kaczmirek CESSDA (https://cmv.cessda.eu/#!validation) is my favorite validator tool as well. I made a pull request the other week (#9484, linked to this issue) that fixes the numerous schema violations in our DDI export. I recommend the CESSDA validator under "how to test" there, with the same profile you mentioned ("CESSDA DATA CATALOGUE (CDC) DDI2.5 PROFILE - MONOLINGUAL: 1.0.4").