IQSS / dataverse

Open source research data repository software
http://dataverse.org
Other
879 stars 492 forks source link

Make Dataverse produce valid DDI codebook 2.5 XML #3648

Closed jomtov closed 1 year ago

jomtov commented 7 years ago

Forwarded from the ticket: https://help.hmdc.harvard.edu/Ticket/Display.html?id=245607


Hello, I tried to validate two items exported to DDI from dataverse.harvard.edu with codebook.xsd (2.5) and got the same types of validation errors described below for item1 (below the line, should work as a well-formed xml-file):

Item 1:https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/BAMCSI

Item 2: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/P4JTOD

What could be done about it (else than meddling with the schema?)

Best regards,

Joakim Philipson Research Data Analyst, Ph.D., MLIS Stockholm University Library

Stockholm University SE-106 91 Stockholm Sweden

Tel: +46-8-16 29 50 Mobile: +46-72-1464702 E-mail: joakim.philipson@sub.su.se http://orcid.org/0000-0001-5699-994X

<?xml version="1.0" encoding="UTF-8"?>

What’s in a name? : Sense and Reference in biodiversity information doi:10.7910/DVN/BAMCSI Harvard Dataverse 2017-01-12 1 Philipson, Joakim, 2017, "What’s in a name? : Sense and Reference in biodiversity information", doi:10.7910/DVN/BAMCSI, Harvard Dataverse, V1 Medicine, Health and Life Sciences Computer and Information Science Metadata PID system Biodiversity Taxonomy "That which we call a rose by any other name would smell as sweet.” Shakespeare has Juliet tell her Romeo that a name is just a convention without meaning, what counts is the reference, the 'thing itself', to which the property of smelling sweet pertains alone. Frege in his classical paper “Über Sinn und Bedeutung” was not so sure, he assumed names can be inherently meaningful, even without a known reference. And Wittgenstein later in Philosophical Investigations (PI) seems to deny the sheer arbitrariness of names and reject looking for meaning out of context, by pointing to our inability to just utter some random sounds and by that really implying e.g. the door. The word cannot simply be separated from its meaning, in the same way as the money from the cow that could be bought for them (PI 120). Scientific names of biota, in particular, are often descriptive of properties pertaining to the organism or species itself. On the other hand, in semantic web technology and Linked Open Data (LOD) there is an overall effort to replace names by their references, in the form of web links or Uniform Resource Identifiers (URIs). “Things, not strings” is the motto. But, even in view of the many "challenges with using names to link digital biodiversity information" that were extensively described in a recent paper, would it at all be possible or even desirable to replace scientific names of biota with URIs? Or would it be sufficient to just identify equivalence relationships between different variants of names of the same biota, having the same reference, and then just link them to the same “thing”, by means of a property sameAs(URI)? The Global Names Architecture (GNA) has a resolver of scientific names that is already doing that kind of work, linking names of biota such as Pinus thunbergii to global identifiers and URIs from other data sources, such as Encyclopedia of Life (EOL) and uBio Namebank. But there may be other challenges with going from a “natural language”, even from a not entirely coherent system of scientific names, to a semantic web ontology, a solution to some of which have been proposed recently by means of so called 'lexical bridges'. Philipson, Joakim Philipson, Joakim 2017-01-12 Philipson, Joakim Summary Data Description Description Information about the and geographic coverage of the study and unit of analysis. CC0 Waiver

dataverse_1062_philipsonErrorTypes.txt

jggautier commented 7 years ago

Thanks @jomtov for moving this issue from our support system!

I thought it might be helpful to give some background on the issue, list what might need to change when the DDI xml is made valid, and describe the errors.

As background for anyone else interested, the DDI xml that Dataverse generates for each dataset (and datafile) needs to follow DDI's schema, so that other repositories and applications using DDI xml can use it (e.g. during harvesting).

To answer jomtov's question, I think Dataverse's xml would need to be corrected. After fixing the errors and making sure the XML is valid, these are what I imagine will need to be adjusted:

There are five errors here, described in the dataverse_1062_philipsonErrorTypes.txt file in jomtov's post:

1. DDI schema doesn't like "DVN" as a value for source in <verStmt source="DVN"> Only "archive" and "producer" are allowed as values.

2. DDI schema doesn't like the URI attribute being called "URI":

_Attribute 'URI' is not allowed to appear in element 'keyword'._

As jomtov points out, the keyword URI is called vocabURI in Dataverse. Unless there's a reason why it's called URI in the DDI XML, I think this is as easy as changing "URI" to "vocabURI", which is okay with the schema.

<keyword vocab="term" vocabURI="http://vocabulary.org/">Metadata</keyword>

3. DDI schema doesn't like where "contact" info is placed:

<sumDscr/>
  <contact affiliation="A University" email="email@domain.com">Name</contact>

_Invalid content was found starting with element '{"ddi:codebook:2_5":contact}'. One of '{"ddi:codebook:2_5":sumDscr, "ddi:codebook:2_5":qualityStatement, "ddi:codebook:2_5":notes, "ddi:codebook:2_5":exPostEvaluation}' is expected._

The DDI schema says that sumDscr shouldn't hold things like contact info. The contact element should be under useStmt:

<useStmt>
...
    <contact affiliation="A University" email="email@domain.com">Name</contact>
...
</useStmt>

4 and 5. DDI schema doesn't like <useStmt> being followed by a value, here the value being the license: <useStmt>CC0 Waiver</useStmt>

Two of the elements that can be nested under <useStmt> are <restrctn> and <conditions>. Either element seems appropriate for holding license info. to me. The schema's descriptions of the two elements makes <conditions> sound like a catchall and <restrctn> sound like the primary element to use. However, ICPSR uses <conditions> for license-like info.

Lastly, this isn't one of the five errors reported, but DDI likes <dataAccs> a level under <useStmt>. (Right now it's a level under <stdydscr>.) So the following change should fix these errors:

<useStmt>
  <dataAccs>
    <conditions>CC0 Waiver</conditions>
    <contact>...</contact>
  </dataAccs>
</useStmt>
jggautier commented 7 years ago

There may be more validation errors (since these two datasets have only some of all possible metadata). @raprasad and I talked yesterday about trying to validate all (or a greater number?) of Harvard Dataverse's DDI XML to find additional errors and make sure the DDI XML is always valid.

There was also some discussion about when and how Dataverse validates the DDI it generates, and making sure that process is working.

pdurbin commented 7 years ago

@jomtov would you be able to tell us what tools you're using to validate against a DDI 2.5 schema? I documented how to validate against DDI 2.0 using MSV (Multi Schema Validator) at http://guides.dataverse.org/en/4.6/developers/tools.html#msv but I seem to recall that DDI 2.5 is more complicate and requires multiple schema file or something. I don't think I ever figured out to use MSV to validate DDI 2.5. Do you use some other tool? Any tips for me? Thanks!

jomtov commented 7 years ago

@pdurbin, I used the schema found in the schemaLocation of the exported xml-files of the item examples above: <codeBook xmlns="ddi:codebook:2_5" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="ddi:codebook:2_5 http://www.ddialliance.org/Specification/DDI-Codebook/2.5/XMLSchema/codebook.xsd" version="2.5"> in oXygen xml-editor 18 with Xerces validation engine. I don't think you need to invoke multiple schemas here, the errortypes are clearly described and have corresponding entries in the codebook.xsd 2.5-schema.

pdurbin commented 7 years ago

Ah, thanks @jomtov. Judging from its Wikipedia page, the Oxygen XML Editor is not free and open source. Bummer.

In a491cd9 I just pushed some code to demonstrate the difficultly I've seen in validating against that codebook.xsd file you mentioned, which was I checked into the code base long ago when I first attempted (and failed) to get Dataverse to validate the DDI 2.5 it exports.

The failing Travis build from that commit at demonstrates the error I'm seeing:

Tests in error:

testValidateXml(edu.harvard.iq.dataverse.util.xml.XmlValidatorTest): src-resolve: Cannot resolve the name 'xml:lang' to a(n) 'attribute declaration' component.

That's from https://travis-ci.org/IQSS/dataverse/builds/208627544#L3805

Does anyone have any idea how to fix this test? Here's the line that's failing: https://github.com/IQSS/dataverse/blob/a491cd941493f498c320dc79f35d430e623710c8/src/test/java/edu/harvard/iq/dataverse/util/xml/XmlValidatorTest.java#L26

jomtov commented 7 years ago

Well, @pdurbin, https://www.corefiling.com/opensource/schemaValidate.html (also on GitHub) is a free xml validator online that seems to work anyway. I uploaded the codebook.xsd and one of the erroneous export-items from above and validated - here attached as .txt -files, since .xsd and .xml are not supported by GitHub, to be 'reconverted' again before use: codebook.txt dataverse_1062_Philipson_newexp2DDIcb.txt

True, the validator did not find some of the other referenced schemas, but they are not relevant here, and all the specific codebook.xsd validation errors seems to be identified anyway (scrolling down in the results):

Validation 1, 504 cvc-enumeration-valid: Value 'DVN' is not facet-valid with respect to enumeration '[archive, producer]'. It must be a value from the enumeration. Validation 1, 504 cvc-attribute.3: The value 'DVN' of attribute 'source' on element 'verStmt' is not valid with respect to its type, '#AnonType_sourceGLOBALS'. Validation 1, 1314 cvc-complex-type.3.2.2: Attribute 'URI' is not allowed to appear in element 'keyword'. Validation 1, 1402 cvc-complex-type.3.2.2: Attribute 'URI' is not allowed to appear in element 'keyword'. Validation 1, 1498 cvc-complex-type.3.2.2: Attribute 'URI' is not allowed to appear in element 'keyword'. Validation 1, 1600 cvc-complex-type.3.2.2: Attribute 'URI' is not allowed to appear in element 'keyword'. Validation 1, 3918 cvc-complex-type.2.4.a: Invalid content was found starting with element 'contact'. One of '{"ddi:codebook:2_5":sumDscr, "ddi:codebook:2_5":qualityStatement, "ddi:codebook:2_5":notes, "ddi:codebook:2_5":exPostEvaluation}' is expected. Validation 1, 4071 cvc-complex-type.2.4.a: Invalid content was found starting with element 'useStmt'. One of '{"ddi:codebook:2_5":method, "ddi:codebook:2_5":dataAccs, "ddi:codebook:2_5":othrStdyMat, "ddi:codebook:2_5":notes}' is expected. Validation 1, 4091 cvc-complex-type.2.3: Element 'useStmt' cannot have character [children], because the type's content type is element-only.<

Maybe this could be useful?

pdurbin commented 7 years ago

@jomtov thanks for the pointer to https://www.corefiling.com/opensource/schemaValidate.html which I just tried. It seems to work great. It's perfect for one-off validation of an XML file against a schema. To be clear, what I was trying to say in https://github.com/IQSS/dataverse/issues/3648#issuecomment-284756817 is that I'd like to teach Dataverse itself to validate XML against a schema. It works for DDI 2.0 but not DDi 2.5. I still don't understand why. For the Java developers reading this, a491cd9 is the commit I made the other day.

jggautier commented 7 years ago

Hello, I tried to validate two items exported to DDI from dataverse.harvard.edu with codebook.xsd (2.5) and got the same types of validation errors described below for item1 (below the line, should work as a well-formed xml-file):

Item 1:https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/BAMCSI (direct link to dataset's DDI xml)

Hi @jomtov. Here's the corrected DDI xml for the first dataset: valid_DDIXMLforItem1.zip. At first I misinterpreted the errors you posted, but I've got it down now. It's valid as far as I can tell. The online tool you mentioned keeps timing out for me. When you get the chance, could you check to see if the corrected DDI xml is valid with the tool you use?

A while back @pdurbin posted a DDI xml file for a dataset with most of the metadata fields that Dataverse exports. That file and the corrected file (validated with "topic classification" included) are here: invalid_and_valid_DDIxml.zip. Most of the corrections were just moving elements around in the xml, but some involved changing which fields the elements go into (e.g. CC0, or what's entered into Terms of Use if CC0 isn't chosen, can't go into useStmt since that element doesn't take a value; it takes only other elements, and license metadata doesn't fit in those subelements. I moved it to the copyright element, where ICPSR and ADA put their license metadata) or how many times an element can be repeated. These changes mean:

I'd like to rename this issue to something like "Make Dataverse produce valid DDI codebook 2.5 xml", which would involve "teaching Dataverse itself to validate" DDI xml against the codebook 2.5 schema.

pdurbin commented 7 years ago

@jomtov are you ok with renaming this issue as @jggautier suggests?

jomtov commented 7 years ago

@pdurbin and @jggautier, Yes, I am OK with the renaming suggested. (Sorry for belated answer, been on vacation off-line for a while.) Keep up the good work!

jggautier commented 7 years ago

The xml files in my earlier comment (ZIP file) don't have most of the metadata in the Terms tab, so the corrections don't take that metadata into account. Current exported DDI from Dataverse has most of the Terms metadata in the right DDI element, but just in the wrong place in the xml.

The exception is the Terms of Access metadata field - whatever's entered there is exported to DDI's dataAccs element, which shouldn't take a value (like the useStmt problem in my earlier comment). The Terms of Access field deals with file level restrictions, which may be handled differently with the upcoming work on DataTags integration, so work may need to be done to map file-level terms and access metadata to DDI.

jggautier commented 7 years ago

I wrote a doc describing what I think are most of the mapping changes needed: https://drive.google.com/open?id=1ICXRL8DP5fCGYiRyRphh_3OotNaWOOak1VmnyufBNsM

I'm pointing our ADA friends to this issue and doc, especially the part about the Terms metadata, since I think the invalid mapping has complicated their own work mapping ADA's DDI to Dataverse's for their planned migration.

pdurbin commented 7 years ago

I rewrote the XML validator in Dataverse an now have a test to validate XML we send to DataCite (it operates on a static file) and I added a FIXME to use the validator with DDI as well: https://github.com/IQSS/dataverse/blob/825332bef8fbb2de23b6fe0fe261ae0bc173194d/src/test/java/edu/harvard/iq/dataverse/util/xml/XmlValidatorTest.java#L22

mheppler commented 5 years ago

There was a recent PR submitted, related to codebooks, 739 html codebook #6081.

jggautier commented 4 years ago

In the document about making Dataverse's DDI XML valid, I added a section about how the XML becomes invalid when depositors enter double quotes in some of Dataverse's fields (specifically any field mapped to an element attribute, e.g. Author affiliation).

I also updated the example valid DDI XML to use https in the schema location URL (https://github.com/IQSS/dataverse/issues/6553)

jggautier commented 4 years ago

Can't believe it took me this long to realize and ask about it, but in 2017 @pdurbin wrote:

To be clear, what I was trying to say in IQSS/dataverse#3648 (comment) is that I'd like to teach Dataverse itself to validate XML against a schema. It works for DDI 2.0 but not DDi 2.5. I still don't understand why.

By "it works for DDI 2.0," does that mean that Dataverse's DDI exports validate against the DDI Codebook 2.0 schema (or used to validate against the 2.0 schema back in 2017)? If so, should the DDI exports be pointing to the 2.0 schema location instead of the 2.5 schema location?

pdurbin commented 4 years ago

@jggautier in a491cd9 I had a test in Dataverse that validates against the DDI 2.0 Codebook Schema...

Screen Shot 2020-02-25 at 4 47 05 PM

... but that was in a branch called "3648-ddi-2.5-validation" that was never merged. It looks like I wrote about this a bit at https://github.com/IQSS/dataverse/issues/3648#issuecomment-284756817 and please see also the code comments above in the screenshot.

jomtov commented 4 years ago

This is an excerpt from an e-mail I sent on Jan. 26, 2020, to @scolapasta after the European Dataverse workshop in Tromsø in January 2020.

Here just a few references to the issues I mentioned then:

https://github.com/IQSS/dataverse/issues/3648#issuecomment-315192962

tried again with DDI md export of https://doi.org/10.7910/DVN/YLWCSU

and this one: https:/doi.org/10.7910/DVN/F6OLFG

http://www.ddialliance.org/Specification/DDI-Codebook/2.5/XMLSchema/codebook.xsd

Although not the exact same errors as in the above issue, but still some instances of possibly unnecessary / avoidable errors.

jggautier commented 4 years ago

@jomtov, some of the easier-to-fix validation errors, including most of the five discussed in this comment, are being addressed as part of an effort to improve exporting DDI from one Dataverse repository and importing that DDI into another Dataverse repository (#6669), although it doesn't fix everything. The doc I shared a few years ago addresses all of the errors and details solutions for some of the not so easy to fix errors. I plan to update that doc once the changes in IQSS/dataverse#6669 go live.

jomtov commented 4 years ago

@jggautier, Great! Appreciate your efforts.

JingMa87 commented 4 years ago

Since this issue is from 2017, most of the 5 issues Julian mentioned were already fixed by https://github.com/IQSS/dataverse/issues/6650. I'll go over them below and which one I'm fixing.

1. DDI schema doesn't like "DVN" as a value for source in Only "archive" and "producer" are allowed as values.

I changed the one instance of "DVN" into the default value "producer".

2. DDI schema doesn't like the URI attribute being called "URI":

_Attribute 'URI' is not allowed to appear in element 'keyword'._

This has already been fixed and merged.

3. DDI schema doesn't like where "contact" info is placed:

<sumDscr/>
  <contact affiliation="A University" email="email@domain.com">Name</contact>

This has already been fixed and merged.

4 and 5. DDI schema doesn't like <useStmt> being followed by a value, here the value being the license: <useStmt>CC0 Waiver</useStmt>

This has already been fixed and merged. Dataverse now puts the license info in the notes element and not in the useStmt element: image

Lastly, this isn't one of the five errors reported, but DDI likes <dataAccs> a level under <useStmt>. (Right now it's a level under <stdydscr>.)

According to the codebook, <dataAccs> should be under <stdydscr> so there's no problem: image

I'll also go over https://github.com/IQSS/dataverse/issues/6650 and Julian's google doc to see if there's extra improvements I can make. I've tried online XSD validators including the ones mentioned but they all fail in using the 2.5 codebook XSD.

JingMa87 commented 4 years ago

@jggautier

geoBndBox I want to make changes to the Geographic Bounding Box data below, but how can I test it on my local Dataverse? I can't seem to add geo metadata.

image

distrbtr Have you concluded what you want to do with the logo URL yet? Currently, it's in the role attribute but this attribute is invalid.

jggautier commented 4 years ago

Thanks @JingMa87.

Just an FYI, I commented that since some of the issues have been fixed, I would update the Google Doc that addresses all of the validation errors and details solutions for some of the not so easy to fix errors. But I haven't found time to do that, so some of the problems described in the Google Doc have already been fixed.

jggautier commented 4 years ago

geoBndBox I want to make changes to the Geographic Bounding Box data below, but how can I test it on my local Dataverse? I can't seem to add geo metadata.

You're not able to add the geospatial metadatablock to your local Dataverse? Have you had a chance to look at the metadata customization sections in the admin guides?

distrbtr Have you concluded what you want to do with the logo URL yet? Currently, it's in the role attribute but this attribute is invalid.

I opened https://github.com/IQSS/dataverse/issues/4428 about one problem with the distributor and producer logo fields being broken URLs. @pdurbin mentioned issues with logo URLs that use http and https. But the issue doesn't mention how storing the logo URLs in the DDI export invalidates the DDI.xml. The Google Doc does. One solution mentioned in https://github.com/IQSS/dataverse/issues/4428 would involve removing logo URL metadata from the DDI export entirely, which would fix the DDI validation issues it causes. But the work of thinking through that and other solutions hasn't been prioritized.

JingMa87 commented 4 years ago

@jggautier

All four points? To me the optionality of each point of a bounding box is strange since you need all four to make the box.

image

Moreover, DDI requires a bounding box element to have exactly one occurrence of every point. The default minOccurs and maxOccurs is 1 when not specified.

image

To me it makes more sense that the four points should always be filled in when the "Geographic Bounding Box" checkbox is ticked, so the four options in the UI for the points should be removed. But this depends, of course, on the people who upload this kind of data.

image

Unlimited boxes? Also, the geoBndBox element can only occur once according to DDI. In Dataverse you can add as many as you want, does it logically make sense to add unlimited amounts then? Do researchers define multiple boxes?

image

image

Conclusion So we might need both the filtering of the metadata in the database when making the DDI XML for legacy data, as well as a change in the UI for future data in order to comply with DDI.

pdurbin commented 4 years ago

To me it makes more sense that the four points should always be filled in when the "Geographic Bounding Box" checkbox is ticked, so the four options in the UI for the points should be removed.

I think this is an excellent point but it probably deserves to be in its own issue (with the great screenshots). So @JingMa87 if you feel like creating one, please go ahead.

poikilotherm commented 3 years ago

I stumbled over this while refactoring tests for IQSS/dataverse#8000.

Using XmlAssert to check for validity of a fixed version of the "non-test" DdiExporterTest.testExportDataset(), I learned that Dataverse seems to screw up some XML.

grafik

Looks like either DDI 2.5 needs a fix or we do :smile:

BPeuch commented 2 years ago

Hello everybody,

Kind of necroing this issue because we got kicked in the face recently when we learned that the CESSDA Data Catalogue (CDC) administrators had founded formal errors in the DDI outputs all data providers that rely on Dataverse… and that our data would subsequently be rejected by the catalogue's crawler and appear no more in the CDC until the XML is fixed.

Attached below is the list of schema violations for SODHA (our production instance is still version 4.20). If I interpreted these correctly, the recurring errors are:

I wish we could contribute to the Dataverse code, but unfortunately we don't have the necessary personnel for that at the moment :(

But because putting our data out there is a big priority for us, I have to ask: do you think these issues are likely to be solved in the near future?

Sodha Schema Violations 2021-11-17.csv

jggautier commented 2 years ago

Hi @BPeuch. Ha! I don't think you necroed this issue. Since v4.20, there have been a few community contributions that have made the Dataverse software produce these exports that are closer to valid, but I know there are still lots of ways that the Dataverse software can produce invalid DDI-C XML exports.

I think this issue needs to be updated (or a new issue opened and this one closed) with information about the errors produced by the DDI exports of the latest Dataverse software version.

jggautier commented 2 years ago

I'm glad my reply was somewhat helpful. It's interesting to learn that the CESSDA Data Catalogue works this way.

I wouldn't be able to answer your main question about if this could be solved in the near future. Maybe it could be a topic for a community call?

BPeuch commented 2 years ago

Indeed @jggautier that sounds like a great topic for a community call :) But I think in the meantime we will probably start working on this soon because our need is urgent. We will share the results of our work like we have already done with the SUPER DADA script.

qqmyers commented 2 years ago

FWIW: For Sciences PO, I ran the CESSDA validator on recent outputs (i.e. v~5.9+) and, although I was focused on i18n-related warnings, I don't recall other failures. That's not to say they aren't there but a) there are definitely improvements past v4.20, specifically related to i18n/xml:lang tags, and b) it probably would be worthwhile to verify that your issues still exist before you spend time.

jggautier commented 2 years ago

Yeah I agree, and that's how I think issues with the validation of Dataverse's DDI exports have been resolved, right? Specific issues are fixed as they block certain projects? And Dataverse doesn't check if the DDI XML it's told to import, harvest or export is valid against the schema.

But it sounded to me like the CESSDA Data Catalogue does check and doesn't try to read the XML if the XML doesn't completely pass validation, so even if the errors are unrelated to the metadata that CESSDA could index if it can find that metadata in the XML, those errors would have to be resolved before CESSDA would try to crawl the DDI. That's what I found interesting, but do I have the wrong model of CESSDA works?

BPeuch commented 2 years ago

Indeed, there are two 'books of rules' we must look at here.

The CESSDA Validator tests metadata records against the particular selection of optional DDI elements that CESSDA representatives have agreed upon (and which has been documented in several DDI profiles) to ensure consistency across metadata records from several providers.

However, in late 2021, the CESSDA Main Office announced that the DDI output of CESSDA service providers who rely on Dataverse was faulty in terms of mandatory DDI requirements. In other words, regardless of which optional elements CESSDA wants to see in metadata records, according to the Main Office, Dataverse violates several core DDI rules.

For example, as I wrote a few posts ago, the hard DDI rule is that the @source attribute of the element docDscr/citation/verStmt must have either 'archive' or 'producer' as a value. This is not a choice from CESSDA; it is the essential DDI grammar.

As far as I know, the CESSDA Validator does not check XML-DDI consistency; it only checks whether a metadata record validates whichever CESSDA profile has been selected.

qqmyers commented 2 years ago

@BPeuch - thanks for the clarification. There are definitely other installations that want to have data included in CESSDA's results, so if you can make progress, that's great. If making code changes is the issue, getting to a clear set of changes would help with prioritizing. (I.e. for the verStmt needing archive or producer - which one? Can the community agree on one or does it need to be configurable, etc.) Your list above is probably a good start and either raising those points in a community email and/or proposing a specific solution for them to see if there are community concerns would probably be useful. A PR is even better, but I expect these sorts of questions could come up in review if they haven't been discussed. I'll try to raise the awareness of these issues in the community calls too - it's possible someone else is already looking into them as well and just hadn't found this issue.

BPeuch commented 2 years ago

We have just noticed something here at SODHA while updating a Dataverse test instance to version 5.10.1. It seems like the value encoded in GUI element "Subject" always appears twice in DDI: once as a "keyword" element with an "xml:lang" attribute, then as a second "keyword" element without that attribute, as shown below:


double


See for metadata examples:

jggautier commented 2 years ago

Thanks @BPeuch. I saw the same thing. I don't think this would make the DDI export invalid against the schema (I don't think an XML validator will complain because this is in the XML document). But it is a problem because the repeated value would be repeated when used to create a record in a Dataverse repository or other repository, right? Would a Dataverse repository or other system know to ignore one of the repeated values and which one to ignore?

The only other GitHub issue I could find that might be related is the closed issue at https://github.com/IQSS/dataverse/issues/8210 and a comment about internationalization support. I'm wondering what the DDI-C export would look like coming from a repository that has enabled internationalization support. Would the keyword element without the xml:lang attribute not be there? I don't think the Harvard Dataverse Repository or the Demo Dataverse has it enabled, although I don't know much about how this works. Do you know if that SODHA test repository has it enabled?

jggautier commented 2 years ago

Actually, I think Scholars Portal, running a forked version of the Dataverse software's v5.8.3 release, has internationalization "enabled," but its DDI-C exports have the same problem, e.g. https://dataverse.scholarsportal.info/api/datasets/export?exporter=ddi&persistentId=doi:10.5683/SP3/MRPMTM

pdurbin commented 2 years ago

The following issue and 9d0eeff might be related to the issue with Subject:

BPeuch commented 2 years ago

Thanks for the feedback @jggautier :) I think you're right: I don't suppose even a very stringent XML validator would mind this. Of course, it is an issue because it means one extra line per dataset (since Subject is always mandatory in non-forked Dataverse) and for instance, with Harvard Dataverse, this means 150,000 extra lines!

But we are also faced with language/internationalization problems here and I wonder if this issue might not be related to it. We are still gathering data — I don't want to blame Dataverse when the issue might actually be purely at our level 🔌🛠️

BPeuch commented 2 years ago

@jggautier This was a bit short on my part. The thing is, we have developed a script to make our metadata compliant with the CESSDA DDI profiles. This involved adding "xml:lang=en" ("en" as the default value because we did not have the time to investigate how to detect languages; the point was to get our metadata into the CESSDA Data Catalogue). But now we are studying how the version 5.7 new feature for specifying metadata language works. We will likely provide more feedback once we get it!

BPeuch commented 2 years ago

An update on that last problem!

We weren't sure where exactly was the source at which Dataverse retrieved the language parameter for the xml:lang attribute. Then we found out it was the language parameter at the server level (is that correct?).

It used to indicate French, so we changed it to "en_US.UTF-8."

And now, somehow, the element duplication problem is apparently solved. 🤔

It seems so when comparing for instance the metadata of a dataset before we changed the parameter and those of a dataset after the change (link to the respective datasets' metadata included):

Before the change (duplication)

before

After the change (no more duplication)

after

pdurbin commented 2 years ago

@siacus @mreekie @qqmyers @sekmiller and I just met to talk about harvesting, especially in the context of the NIH OTA grant:

I brought up this issue (#3648) as an important one to fix. To me this issue is about making sure our DDI is valid according to the spec. (DDI is one of the formats that can be used in OAI-PMH.) The community seems to be very interested in this issue. There are validation tools that can help.

Here are some related issues:

mreekie commented 2 years ago

Outcome from today:

mreekie commented 1 year ago

Grooming Note:

mreekie commented 1 year ago

For tomorrow's discussion - From @landreev - slack

mreekie commented 1 year ago

priority review with Stefano

mreekie commented 1 year ago

Sizing:

Discussion:

Outcome:

This is relatively straightforward unless we hit an unknown unknown.

size: ~80~


We hit the end of the sizing session with still some discussion going on. We will revisit this as the first item to be sized in our next session.

qqmyers commented 1 year ago

FWIW: edu.harvard.iq.dataverse.util.xml.XMLValidator.validateXmlSchema(String fileToValidate, URL schemaToValidateAgainst) might be useful.

mreekie commented 1 year ago

sizing: