DOREMUS-ANR / itema3converter

Converter from ITEMA3 xml to DOREMUS rdf
Apache License 2.0
0 stars 0 forks source link

Catalogue statement without space #19

Closed pasqLisena closed 6 years ago

pasqLisena commented 6 years ago

Je rectifie ce que j'ai dit lors de notre AG concernant l'extraction de numéros de catalogues à partir des titres de itema3: cela a été fait correctement pour un certain nombre de ressources (il y en a simplement beaucoup qui n'ont pas de numéros de catalogue du tout). Par contre, il existe toujours un nombre d'instances pour lesquelles il semble que cela n'a pas été fait, voici un exemple de la nouvelle livraison itema3:

<http://data.doremus.org/expression/c4a77640-84ba-31e7-abf0-89e52e317480>
       a                           efrbroo:F22_Self-Contained_Expression ;
       rdfs:comment                "pour voix et piano, transcription non identifi?e pour septuor ? cordes"@fr ;
       rdfs:label                  "5 Lieder op 41 TrV195" ;
       mus:U12_has_genre           <http://data.doremus.org/vocabulary/itema3/genre/musdoc/51> , <http://data.doremus.org/vocabulary/itema3/genre/musdoc/116> ;
       mus:U17_has_opus_statement  <http://data.doremus.org/expression/c4a77640-84ba-31e7-abf0-89e52e317480/opus/41> ;
       mus:U19_has_style           [ a           mus:M19_Style ;
                                     rdfs:label  "septuor"
                                   ] ;
       mus:U19_has_style           [ a           mus:M19_Style ;
                                     rdfs:label  "musique de chambre"
                                   ] ;
       ecrm:P102_has_title         "5 Lieder op 41 TrV195" ;
       ecrm:P3_has_note            "pour voix et piano, transcription non identifi?e pour septuor ? cordes"@fr ;
       dcterms:identifier          "m20069639" .

Le numéro de catalogue est TrV195 (identifié chez la BnF).

-- issue by @kgtodorov

pasqLisena commented 6 years ago

The problem here is the missing of the space (TrV 195 or TrV.195 would have been recognised correctly). I try to see if including these cases do not have bad effects on the rest.

Have you other examples not involving the missing space? Just to be sure that we fully fix the bug.

pierrechoffe commented 6 years ago

as an aside, if I'm not wrong, the following

ecrm:P3_has_note "pour voix et piano, transcription non identifi?e pour septuor ? cordes"@fr

should use property u67_has_subtitle (or u72_has_title_note ?), not p3_has_note

plus, what do we do with unrecognized special characters (é and à transformed into ?)

rtroncy commented 6 years ago

plus, what do we do with unrecognized special characters (é and à transformed into ?)

Those characters are well handled in the data. Konstantin made a copy paste from a UI that doesn't display well. @pierrechoffe Don't trust UI, trust the data :-)

rtroncy commented 6 years ago

Regarding the choice of the most appropriate property, this is a decision of Martine. ecrm:P3 is indeed generic. I'm not sure how much we make use already of mus:U72 or mus:U67. @pasqLisena Can you ask Martine what the mapping should be? Or you may have also an opinion of what should be used

pasqLisena commented 6 years ago

That text is in the OMU_DESCRIPTION field in the source file (that can potentially contain other kinds of comment).

The ecrm:P3 property is the only one which (even if generic) fits all the possible contents.

pierrechoffe commented 6 years ago

The ecrm:P3 property is the only one which (even if generic) fits all the possible contents.

Yep I suspected that, RF use the same field for different contents, so it has to be a generic prop indeed. Do you have any idea of how much u67 and u72 are used at PP and BnF?

rtroncy commented 6 years ago

@pierrechoffe When we will load new dumps in the triple store, than SPARQL will be your friend to get counts and usage information.

pasqLisena commented 6 years ago

Other examples:

pasqLisena commented 6 years ago

Last commit solves the 3 last examples (update of the vocabularies)