Validation Issue: correct XML file is marked as wrong

ttrippel commented 5 years ago

The curation module says for the XML validation section: Invalid Records: https_talar_sfb833_uni_tuebingen_de_8443_erdora_rest_SFB833_A02_Gedichte_20Emily_20Dickinson_Gedichtkorpus_20Emily_20Dickinson

This refers to the metadata available at https://talar.sfb833.uni-tuebingen.de:8443/erdora/displaycmdi?path=%2FSFB833%2FA02%2FGedichte+Emily+Dickinson%2FGedichtkorpus+Emily+Dickinson%2FFID_129.cmdi.xml

However, this file is valid - at least according to two XML parsers. It would be helpful to provide the complete validation error - if any - to verify that there is a problem.

ttrippel commented 5 years ago

The error is listed on https://curate.acdh.oeaw.ac.at/#!ResultView/collection//Tubingen_Archive_of_Language_Resources_TALAR_

One way of getting the record that is listed as erroneous via curl https://talar.sfb833.uni-tuebingen.de:8443/erdora/rest/oai?verb=GetRecord\&identifier=https://talar.sfb833.uni-tuebingen.de:8443/erdora/rest/SFB833/A02/Gedichte%20Emily%20Dickinson/Gedichtkorpus%20Emily%20Dickinson\&metadataPrefix=cmdi Validation with xerces and saxon within Oxygen. Any other possible source of error in the curation module? Or any other ideas what may be wrong?

kosarko commented 5 years ago

I guess I am seeing a false positive too: https://curate.acdh.oeaw.ac.at/#!ResultView/instance/url/https://vlo.clarin.eu/data/clarin/results/cmdi/LINDAT_CLARIN_digital_library_at_the_Institute_of_Formal_and_Applied_Linguistics_UFAL_Faculty_of_Mathematics_and_Physics_Charles_University/oai_lindat_mff_cuni_cz_11234_5_CESILKO.xml The error is:

xml-validation	ERROR	line: 1, col: 512 - cvc-elt.1: Cannot find the declaration of element 'cmd:CMD'.

Xerces does not complain and this seems like some kind of a namespace/xsd mixup. VLO does some magic to the namespaces and schemalocation on harvest ( adding /1 or /1.x/, I think that's a due to cmdiv2 ). But if I grab the file from vlo and use what's in schemalocation it works, if I extract the cmdi from our oai endpoint and again validate against what's in schemalocation it works too.

@ttrippel You get bit more details if you upload the particular cmdi file, see instances tab.

ttrippel commented 5 years ago

You are right, @kosarko , in my instance it reports an "invalid character", which is not invalid in XML or http terms. Still a case of a false positive. WRT to your issue: your file is a CMDI 1.1 file (https://lindat.mff.cuni.cz/repository/oai/cite?metadataPrefix=cmdi&handle=11234/5-CESILKO); the VLO converts it automatically to CMDI 1.2 internally. You might want to switch to CMDI 1.2, which is especially beneficial if you are using multiple profiles and providing CMDI via OAI-PMH (else you have to watch out for the namespace binding, etc.)

wowasa commented 5 years ago

Dear Gentlemen, sorry for the late reply - I hadn't gotten any notification for this issue.

We have a bug in curation 1.3 (the version currently in production) which occurs under certain conditions for files which are provided in CMDI 1.1 format. As you might know the harvester converts automatically old CMDI 1.1 into current CMDI 1.2, which we use then for charging the VLO and for generating the collection reports in curation. Apparently the profile cache of curation 1.3 mixes up CMDI 1.1 and CMDI 1.2 profiles in a way that a CDMI file is validated with a profile of the wrong version. Currently I'm testing curation 2.0, which should fix the bug since it doesn't use any CMDI 1.1 profiles anymore. Uploaded CMDI 1.1 files will be transformed automatically in CMDI 1.2 and the analysis is done on this base since it reflects best what the importer is doing. The release is planed either for next week or the second week of January 2019.

wowasa commented 5 years ago

curation 2.0, which should solve the issue, is in production now

kosarko commented 5 years ago

@wowasa Works for me.

ttrippel commented 5 years ago

https://curate.acdh.oeaw.ac.at/#!ResultView/collection//Tubingen_Archive_of_Language_Resources_TALAR_ still shows 1. Validation error, which is not invalid according to my parsers. Still says illegal character, but I am positive that there is no illegal character. According to the error message the illegal character is supposed to be somewhere in http://hdl.handle.net/11022/0000-0000-2C97-5@Bauer et al. - The Two Coeval Come.pdf</cmd:ResourceRef> For me it looks like a problem of handling the space character within a URL, which does not have to be percent encoded (%20), but often is by applications. So according to specification the file is syntactically correct, but it seems that whitespace handling in xs:anyURI is a problem for the curation module. As some other URIs in the collection also contain whitespaces, I have no explanation of what happens here.

wowasa commented 5 years ago

@ttrippel : I will have it look for the cause and answer asp. It reminds me that we have in internal issue to add the validation result to the report. This will be a feature of curation 2.1

wowasa commented 5 years ago

@ttrippel : with regard to the log the validation error is not on error you can locally reproduce with a schema validation but a validation error from the linkchecker, a Java program which is used as an API in the process of collection report generation. Less technically speaking: I have to discuss with my colleagues next week, if we should treat a URL with spaces as valid since obviously the Java class which sends a request to the URL doesn't treat it as a valid case. Which would mean we had to URLencode it before sending the request.

twagoo commented 5 years ago

@ttrippel this is what the VLO imports and what is evaluated by the curation module : https://vlo.clarin.eu/data/clarin/results/cmdi/Tubingen_Archive_of_Language_Resources_TALAR_/https_talar_sfb833_uni_tuebingen_de_8443_erdora_rest_SFB833_A02_Gedichte_20Emily_20Dickinson_Gedichtkorpus_20Emily_20Dickinson.xml

In oXygen, when I try to validate the document I get this:

System ID: /Users/twagoo/Desktop/https_talar_sfb833_uni_tuebingen_de_8443_erdora_rest_SFB833_A02_Gedichte_20Emily_20Dickinson_Gedichtkorpus_20Emily_20Dickinson.xml
Main validation file: /Users/twagoo/Desktop/https_talar_sfb833_uni_tuebingen_de_8443_erdora_rest_SFB833_A02_Gedichte_20Emily_20Dickinson_Gedichtkorpus_20Emily_20Dickinson.xml
Engine name: oXygen
Severity: error
Description: There is no schema or DTD associated with the document. You can create an association either with the Associate Schema action or configuring in the Options the Preferences/Document Type Association list, or by creating a Validation Scenario.

Indeed there is no schema location specified. This might be the cause of the error reported by the curation module. But I do not know enough about the workings of the curation module to say for sure.

Also, I cannot access the original file via its self link (http://hdl.handle.net/11022/0000-0000-2C97-5 - see VLO record page) so I cannot say for sure where the problem originates.

ttrippel commented 5 years ago

Sorry, the stylesheet transformation had a problem, I returned to the previous XSLT, http://hdl.handle.net/11022/0000-0000-2C97-5 show up in your browser. Oxygen validates that file; your browser just did not show it before, you might have seen a blank page, but the source code was the valide CMDI file.

I don't understand why the VLO-mirror does not contain a schema reference. AFAIK the OAI-PMH harvest results in a valid XML document including all namespaces; I'll check this again for the whole thing, but for the record in the report it still seems to be the problem @wowasa analysed: handling of blanks by the library. BTW I fully agree that blanks should be avoided. But with existing data there might be a challenge.

twagoo commented 5 years ago

@ttrippel this is from the raw response from Tübingen OAI-PMH endpoint (full reponse: oai-repsonse.xml.gz):

        <ns4:record>
            <ns4:header>
                <ns4:identifier>https://talar.sfb833.uni-tuebingen.de:8443/erdora/rest/SFB833/A02/Gedichte%20Emily%20Dickinson/Gedichtkorpus%20Emily%20Dickinson</ns4:identifier>
                <ns4:datestamp>2017-03-14T11:54:51Z</ns4:datestamp>
                <ns4:setSpec>TuebingenHostedResources</ns4:setSpec>
                <ns4:setSpec>SFB833Tuebingen</ns4:setSpec>
            </ns4:header>
            <ns4:metadata>
                <ns7:CMD CMDVersion="1.2">
                    <cmd:Header xmlns:cmd="http://www.clarin.eu/cmd/1" xmlns:cmdp="http://www.clarin.eu/cmd/1/profiles/clarin.eu:cr1:p_1527668176122">
                        <cmd:MdCreator>cmdicast</cmd:MdCreator>
                        <cmd:MdCreationDate>2012-10-11</cmd:MdCreationDate>
                        <cmd:MdSelfLink>http://hdl.handle.net/11022/0000-0000-2C97-5</cmd:MdSelfLink>
                        <cmd:MdProfile>clarin.eu:cr1:p_1527668176126</cmd:MdProfile>
                        <cmd:MdCollectionDisplayName>Tübingen Language Resources</cmd:MdCollectionDisplayName>
    </cmd:Header>
                    <cmd:Resources xmlns:cmd="http://www.clarin.eu/cmd/1" xmlns:cmdp="http://www.clarin.eu/cmd/1/profiles/clarin.eu:cr1:p_1527668176122">
....

I believe that the harvester extracts the verbatim content so the schema location must be embedded in the record but I could be mistaken. I hope that @menzowindhouwer can shed some light on this.

menzowindhouwer commented 5 years ago

The OAI harvester action that takes the CMD record out of the OAI envelop indeed doesn't copy over any XSD related attributes from the envelop, i.e., it expects them to be on the <cmd:CMD> root. We might do so, although in this case the xsi:schemaLocation does contain a lot that isn't relevant for one specific record.

<ns4:OAI-PMH ... xsi:schemaLocation="
  http://www.clarin.eu/cmd/1 https://infra.clarin.eu/CMDI/1.x/xsd/cmd-envelop.xsd
  http://www.clarin.eu/cmd/ http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/profiles/clarin.eu:cr1:p_1320657629644/xsd
  http://www.clarin.eu/cmd/1/profiles/clarin.eu:cr1:p_1527668176122 https://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/1.1/profiles/clarin.eu:cr1:p_1527668176122/1.2/xsd  
  http://www.clarin.eu/cmd/1/profiles/clarin.eu:cr1:p_1527668176123 https://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/1.1/profiles/clarin.eu:cr1:p_1527668176123/1.2/xsd  
  http://www.clarin.eu/cmd/1/profiles/clarin.eu:cr1:p_1527668176124 https://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/1.1/profiles/clarin.eu:cr1:p_1527668176124/1.2/xsd  
  http://www.clarin.eu/cmd/1/profiles/clarin.eu:cr1:p_1527668176125 https://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/1.1/profiles/clarin.eu:cr1:p_1527668176125/1.2/xsd  
  http://www.clarin.eu/cmd/1/profiles/clarin.eu:cr1:p_1527668176126 https://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/1.1/profiles/clarin.eu:cr1:p_1527668176126/1.2/xsd   
  http://www.clarin.eu/cmd/1/profiles/clarin.eu:cr1:p_1527668176127 https://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/1.1/profiles/clarin.eu:cr1:p_1527668176127/1.2/xsd  
  http://www.clarin.eu/cmd/1/profiles/clarin.eu:cr1:p_1527668176128 https://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/1.1/profiles/clarin.eu:cr1:p_1527668176128/1.2/xsd  
  http://www.clarin.eu/cmd/1/profiles/clarin.eu:cr1:p_1524652309872 https://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/1.1/profiles/clarin.eu:cr1:p_1524652309872/1.2/xsd   
  http://www.clarin.eu/cmd/1/profiles/clarin.eu:cr1:p_1524652309874 https://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/1.1/profiles/clarin.eu:cr1:p_1524652309874/1.2/xsd  
  http://www.clarin.eu/cmd/1/profiles/clarin.eu:cr1:p_1524652309875 https://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/1.1/profiles/clarin.eu:cr1:p_1524652309875/1.2/xsd  
  http://www.clarin.eu/cmd/1/profiles/clarin.eu:cr1:p_1524652309876 https://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/1.1/profiles/clarin.eu:cr1:p_1524652309876/1.2/xsd  
  http://www.clarin.eu/cmd/1/profiles/clarin.eu:cr1:p_1524652309877 https://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/1.1/profiles/clarin.eu:cr1:p_1524652309877/1.2/xsd  
  http://www.clarin.eu/cmd/1/profiles/clarin.eu:cr1:p_1524652309878 https://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/1.1/profiles/clarin.eu:cr1:p_1524652309878/1.2/xsd  
  http://www.clarin.eu/cmd/1/profiles/clarin.eu:cr1:p_1288172614026 https://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/1.1/profiles/clarin.eu:cr1:p_1288172614026/1.2/xsd  
  http://www.clarin.eu/cmd/1/profiles/clarin.eu:cr1:p_1288172614023 https://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/1.1/profiles/clarin.eu:cr1:p_1288172614023/1.2/xsd 
  http://www.clarin.eu/cmd/1/profiles/clarin.eu:cr1:p_1320657629644 https://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/1.1/profiles/clarin.eu:cr1:p_1320657629644/1.2/xsd 
  http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd 
  http://www.openarchives.org/OAI/2.0/oai_dc/  http://www.openarchives.org/OAI/2.0/oai_dc.xsd   
  http://purl.org/dc/elements/1.1/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dc.xsd     
  http://www.loc.gov/MARC21/slim http://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd">...</ns4:OAI-PMH>

The example record that Twan used would just need:

<ns7:CMD CMDVersion="1.2" xsi:schemaLocation="
  http://www.clarin.eu/cmd/1 https://infra.clarin.eu/CMDI/1.x/xsd/cmd-envelop.xsd 
  http://www.clarin.eu/cmd/1/profiles/clarin.eu:cr1:p_1527668176126 https://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/1.1/profiles/clarin.eu:cr1:p_1527668176126/1.2/xsd 
">...</ns7:CMD>

But this seems to be an unrelated problem to the URI encoding messages, which stem from the Link Checker (so I understand from Wolfgang).

But if the curation module does need a xsi:schemaLocation on the CMD record we should create an issue on the OAI harvester. Or does it determine the schema itself based on the MdProfile?

wowasa commented 5 years ago

the vlo-importer extracts the profile id from in first place from the MdProfile element and in second from schemaLocation. The schema is then taken from a location specified in the vlo-configuration by profile id. So does the curation module - everything else would be a bug. So I don't see why we're discussing the missing schema location since this is not the cause of Thorsten's issue. Question is: how should the linkchecker process links which are containing blancs? Since at the moment the used java class throws an exception which could be avoided by url-encoding

twagoo commented 5 years ago

@wowasa it's a separate issue indeed but I initially suspected that that would be the cause of the validation error. If this is not the issue, we still have a case of unexplained schema validation error - I don't see how this relates to the behaviour of the link checker.

To be sure: the following section of the report is fully based on the XSD validation on the side of the curation module and the original scope of this issue, right?

XML Validation Section

Number of Records: 413

Number of valid Records: 412

Ratio valid Records: 0.9976
Invalid Records:

    https_talar_sfb833_uni_tuebingen_de_8443_erdora_rest_SFB833_A02_Gedichte_20Emily_20Dickinson_Gedichtkorpus_20Emily_20Dickinson

twagoo commented 5 years ago

Instance validation via https://curate.acdh.oeaw.ac.at/#!ResultView/instance/url/https://talar.sfb833.uni-tuebingen.de:8443/erdora/cmdi/SFB833/A02/Gedichte%2520Emily%2520Dickinson/Gedichtkorpus%2520Emily%2520Dickinson

does give me

https:--talar.sfb833.uni-tuebingen.de:8443-erdora-cmdi-SFB833-A02-Gedichte%20Emily%20Dickinson-Gedichtkorpus%20Emily%20Dickinson 
Illegal character in path at index 50:
http://hdl.handle.net/11022/0000-0000-2C97-5@Bauer et al. - The Two Coeval Come.pdf

The first bit: https:--talar.sfb833.uni-tuebingen.de:8443-erdora-.... seems very strange to me. All forward slashes are replaced with hyphens. The error message looks like it is the message of a java.net.URISyntaxException, which makes all the sense.

To verify that this is not a core issue with the curation module, I tried with a random CMDI from HZSK which didn't lead to the same error.

EDIT: the error does refer to the space character indeed. I still suspect that this is a different issue from the one originally reported.

coy123 commented 5 years ago

The current implementation of the linkchecker uses URL encoding already. It first tries the URL as it is and if it throws an exception, it URL encodes the query parameters (anything that comes after ?) and tries again. This change was done 1-2 months ago so the links may have not been checked again since then but it will be checked in the future.

So to sum up, any space that comes after ? in the URL will be encoded. If the space comes before, then the URL is considered broken and I see no problem with that.

wowasa commented 5 years ago

ok - than I was probably wrong

twagoo commented 5 years ago

Indeed, it seems that URLs should not have spaces in the path (see e.g. here). However it should still be possible to carry out an instance evaluation in the curation module on CMDI records that carry such a broken reference. Which doesn't work for this case. My suggestion would be to make a separate issue for that.

Concerning the problem covered by this issue: would it be possible to find the precise XML validation error that the curaiton module runs into for this record (as displayed on the collection page) somewhere?

wowasa commented 5 years ago

@twagoo : not yet but it's an internal issue already. @coy123 : I just checked it again. The error is thrown in line 106 of the HttpLinkChecker at the attempt to instantiate the class HttpHead with the URL. On this level the URL is not encoded. I'm going to create an bug report for this issue

wowasa commented 5 years ago

apparently we had two bugs:

the linkchecker threw an unhandled exception because of the blanks in one of the URLs
the module isolated the profile id from the schema location and used the schema from http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/1.x/profiles/{PROFILE_ID}/xsd for validation instead of the specified schema.

Both bugs are fixed in curation 2.2 which I'm currently testing. I will deploy it on production in the course of this week.

wowasa commented 5 years ago

curation 2.2 is in production now.

wowasa commented 5 years ago

can we close this issue???

twagoo commented 5 years ago

Two files are still marked as invalid (see report). From the collection report:

<xml-validation-section>
        <totNumOfRecords>420</totNumOfRecords>
        <totNumOfValidRecords>418</totNumOfValidRecords>
        <ratioOfValidRecords>0.9952</ratioOfValidRecords>
        <invalid-records>
            <record name="https_talar_sfb833_uni_tuebingen_de_8443_erdora_rest_SFB833_A02_Gedichte_20Emily_20Dickinson_Gedichtkorpus_20Emily_20Dickinson">
                <issue>line: 39, col: 49 - cvc-complex-type.2.4.c: The matching wildcard is strict, but no declaration can be found for element 'cmdp:TextCorpusProfile'.</issue>
            </record>
            <record name="https_talar_sfb833_uni_tuebingen_de_8443_erdora_rest_SFB833_A03_Compositional_Model_Sem_en_vector_rep">
                <issue>line: 476, col: 27 - Key 'PayloadResourceRef' with value 'readme0000-0007-CFDC-9' not found for identity constraint of element 'CMD'.</issue>
                <issue>line: 476, col: 27 - cvc-id.1: There is no ID/IDREF binding for IDREF 'readme0000-0007-CFDC-9'.</issue>
            </record>
        </invalid-records>
    </xml-validation-section>

Do we understand where these come from?

ttrippel commented 5 years ago

From my point of you we may close it. The two open issues: one is a new dataset that indeed had a problem which I did not see before (the ID/IDREF issue), the other document now also validates in the curation module, so if the issue does not reappear after the next harvest, I am fine.

clarin-eric / curation-dashboard

Validation Issue: correct XML file is marked as wrong #31