Closed pasqLisena closed 8 years ago
The string Poème
is present twice in this XML record: lines 70 and lines 123. This file seems to be properly encoded in UTF-8. Does the parsing have problem in both cases? Is the XML parser being used correctly set to parse UTF-8 documents?
Does the parsing have problem in both cases?
Weirdly not. This is the output file that you obtain: https://drive.google.com/file/d/0ByLy_xr6iMAYSHpQVDFvN3cxZXM/view?usp=sharing
This is indeed very mysterious. Do we have other examples where there is a problem too? Can we try to re-encode the '0788075.xml' in case it contains some non-printable characters?
Regarding the RDF output, why do we have:
<http://data.doremus.org/Self_Contained_Expression/F22/UUID>
mus:U12_has_genre [ cidoc-crm:P1_is_identified_by "musique contemporaine"@fr ] ;
instead of:
<http://data.doremus.org/Self_Contained_Expression/F22/UUID>
mus:U12_has_genre <http://data.doremus.org/vocabulary/genre/CODE> ;
Regarding the RDF output, why do we have: [...] instead of: [...]
We have not that label in the vocabulary (I put the same example in this comment ).
I see! This is because we have not yet processed the genre controlled vocabularies used by the Philharmonie, nor integrated it with our main genre SKOS thesaurus. @marie-ototoi Can you please help us and let us know what is the controlled vocabulary you're using for the musical genre at the Philharmonie?
The xml reader was splitting in two lines "po" and "ème". I fixed the bad behavior.
I am parsing this file from Philarmonie: https://drive.google.com/file/d/0ByLy_xr6iMAYb2NHaW5LcFdYQUE/view?usp=sharing
The genre
Poème symphonique
is parsed asPo
. Other strings with diacritic seem to have no problems (e.g. the description, that contains 'poème symphonique' too, is correctly parsed).You can see it by adding a
in PF22_SelfContainedExpression.java#L479