DOREMUS-ANR / marc2rdf

Converter from UNIMARC/INTERMARC to RDF using the DOREMUS model
Apache License 2.0
6 stars 0 forks source link

Incomplete label on MARC parsing #11

Closed pasqLisena closed 8 years ago

pasqLisena commented 8 years ago

I am parsing this file from Philarmonie: https://drive.google.com/file/d/0ByLy_xr6iMAYb2NHaW5LcFdYQUE/view?usp=sharing

The genre Poème symphonique is parsed as Po. Other strings with diacritic seem to have no problems (e.g. the description, that contains 'poème symphonique' too, is correctly parsed).

You can see it by adding a

System.out.println(field);

in PF22_SelfContainedExpression.java#L479

rtroncy commented 8 years ago

The string Poème is present twice in this XML record: lines 70 and lines 123. This file seems to be properly encoded in UTF-8. Does the parsing have problem in both cases? Is the XML parser being used correctly set to parse UTF-8 documents?

pasqLisena commented 8 years ago

Does the parsing have problem in both cases?

Weirdly not. This is the output file that you obtain: https://drive.google.com/file/d/0ByLy_xr6iMAYSHpQVDFvN3cxZXM/view?usp=sharing

rtroncy commented 8 years ago

This is indeed very mysterious. Do we have other examples where there is a problem too? Can we try to re-encode the '0788075.xml' in case it contains some non-printable characters?

Regarding the RDF output, why do we have:

 <http://data.doremus.org/Self_Contained_Expression/F22/UUID>
    mus:U12_has_genre   [ cidoc-crm:P1_is_identified_by "musique contemporaine"@fr ] ;

instead of:

 <http://data.doremus.org/Self_Contained_Expression/F22/UUID>
    mus:U12_has_genre   <http://data.doremus.org/vocabulary/genre/CODE> ;
pasqLisena commented 8 years ago

Regarding the RDF output, why do we have: [...] instead of: [...]

We have not that label in the vocabulary (I put the same example in this comment ).

rtroncy commented 8 years ago

I see! This is because we have not yet processed the genre controlled vocabularies used by the Philharmonie, nor integrated it with our main genre SKOS thesaurus. @marie-ototoi Can you please help us and let us know what is the controlled vocabulary you're using for the musical genre at the Philharmonie?

pasqLisena commented 8 years ago

The xml reader was splitting in two lines "po" and "ème". I fixed the bad behavior.