RMLio / rmlmapper-java

The RMLMapper executes RML rules to generate high quality Linked Data from multiple originally (semi-)structured data sources
http://rml.io
MIT License
144 stars 61 forks source link

StringIndexOutOfBoundsException error with rmlmapper-6.0.0 #192

Closed JBPressac closed 1 year ago

JBPressac commented 1 year ago

Hello, Using rmlmapper-6.0.0-r363-all.jar with the following RML file, I get the following error message while rmlmapper-5.0.0-r362-all.jar generates successfully the triplets :

The error message:

09:37:19.737 [main] ERROR be.ugent.rml.cli.Main               .main(436) - begin 0, end -1, length 9
java.lang.StringIndexOutOfBoundsException: begin 0, end -1, length 9
        at java.base/java.lang.String.checkBoundsBeginEnd(String.java:3319)
        at java.base/java.lang.String.substring(String.java:1874)
        at be.ugent.rml.cli.Main.main(Main.java:197)
        at be.ugent.rml.cli.Main.main(Main.java:45)

The RML file:

prefixes:
 # La validité de ces prefixes ainsi que les URI des propriétés et classes utilisées restent à vérifier
 # Par exemple, crm:P98i_was_born ou crm:P98_was_born ?
 bibo: http://purl.org/ontology/bibo/
 bnf-onto: http://data.bnf.fr/ontology/bnf-onto/
 crm: http://www.cidoc-crm.org/cidoc-crm/
 crm-sup: https://ontome.net/ns/sdh-crm-supplement/
 foaf: http://xmlns.com/foaf/0.1/
 frbroo: http://iflastandards.info/ns/fr/frbr/frbroo/
 grel: http://users.ugent.be/~bjdmeest/function/grel.ttl# 
 idlab-fn: http://example.com/idlab/function/
 idref: https://www.idref.fr/
 isni: http://isni.org/ontology#
 lexvo: http://lexvo.org/id/iso639-3/
 owl: http://www.w3.org/2002/07/owl#  
 prelib: http://mshb.huma-num.fr/prelib/
 sdh: https://ontome.net/ns/sdhss/
 sdh-int: https://ontome.net/ns/intellectual-literary-life
 sdh-so: https://ontome.net/ns/social-life-specific/
 wd: http://www.wikidata.org/entity/ 
 xsd: http://www.w3.org/2001/XMLSchema#

variables:
  access: &dbHost http://localhost/database
  type: &dbType mysql
  referenceFormulation: &csvReferenceFormulation csv
  credentials: &credentials
    username: $(_username)
    password: $(_password)

sources:
 prelib_airegeographique:
  access: *dbHost
  credentials: *credentials
  type: *dbType
  referenceFormulation: *csvReferenceFormulation   
  query : SELECT * FROM prelib_airegeographique; 
 prelib_appellationpersonne:
  access: *dbHost
  credentials: *credentials
  type: *dbType
  referenceFormulation: *csvReferenceFormulation   
  query : > 
   SELECT * FROM prelib_appellationpersonne 
   JOIN prelib_langue ON prelib_langue.id = prelib_appellationpersonne.langue_id; 
 prelib_collectif:
  access: *dbHost
  credentials: *credentials
  type: *dbType
  referenceFormulation: *csvReferenceFormulation   
  query : SELECT * FROM prelib_collectif;
 prelib_collectifecritoeuvre:
  access: *dbHost
  credentials: *credentials
  type: *dbType
  referenceFormulation: *csvReferenceFormulation   
  query: >
   SELECT (CASE WHEN date_ecriture REGEXP '^\\d+$' THEN date_ecriture END) AS annee, 
   prelib_collectifecritoeuvre.* FROM prelib_collectifecritoeuvre 
   JOIN prelib_fonctionecritoeuvre ON prelib_fonctionecritoeuvre.id = prelib_collectifecritoeuvre.fonction_id  
   ORDER BY prelib_collectifecritoeuvre.id
 prelib_ecritoeuvre:
  access: *dbHost
  credentials: *credentials
  type: *dbType
  referenceFormulation: *csvReferenceFormulation
  # date_ecriture est de type VARCHAR et contient parfois des années, parfois des intervales d'années, parfois des dates au format français
  # parfois des commentaires. Pour l'instant, on se contente de récupérer les saisies conformes à la la REGEXP '^\\d+$'. 
  # Puisque prelib_ecritoeuvre contient des relations avec d'autres roles que les auteurs, date_ecriture devrait être considéré 
  # comme la date d'exercice du rôle et non pas la date d'écriture à proprement parler.
  # A FAiRE : Voir s'il ne faudrait tout simplement pas retirer cette information des triplets.
  # Creation des champs suivants :
  # - annee
  query: >
   SELECT (CASE WHEN date_ecriture REGEXP '^\\d+$' THEN date_ecriture END) AS annee, 
   prelib_ecritoeuvre.* FROM prelib_ecritoeuvre 
   JOIN prelib_fonctionecritoeuvre ON prelib_fonctionecritoeuvre.id = prelib_ecritoeuvre.fonction_id  
   ORDER BY prelib_ecritoeuvre.id   
 prelib_editeoeuvre:
  access: *dbHost
  credentials: *credentials
  type: *dbType
  referenceFormulation: *csvReferenceFormulation
  query : SELECT * FROM prelib_editeoeuvre;
 prelib_editerevue:
  access: *dbHost
  credentials: *credentials
  type: *dbType
  referenceFormulation: *csvReferenceFormulation
  query : SELECT * FROM prelib_editerevue;  
 prelib_edition:
  access: *dbHost
  credentials: *credentials
  type: *dbType
  referenceFormulation: *csvReferenceFormulation
  query : SELECT * FROM prelib_edition;
 prelib_oeuvre:
  access: *dbHost
  credentials: *credentials
  type: *dbType
  referenceFormulation: *csvReferenceFormulation   
  query : SELECT * FROM prelib_oeuvre;
 prelib_oeuvreedition:
  access: *dbHost
  credentials: *credentials
  type: *dbType
  referenceFormulation: *csvReferenceFormulation   
  query : SELECT * FROM prelib_oeuvreedition;  
 prelib_oeuvrelangue:
  access: *dbHost
  credentials: *credentials
  type: *dbType
  referenceFormulation: *csvReferenceFormulation   
  query : > 
   SELECT * FROM prelib_oeuvrelangue JOIN prelib_langue 
   WHERE prelib_langue.id = prelib_oeuvrelangue.langue_id;
 prelib_paraitdans:
  access: *dbHost
  credentials: *credentials
  type: *dbType
  referenceFormulation: *csvReferenceFormulation   
  query : SELECT * FROM prelib_paraitdans;    
 prelib_personne:
  access: *dbHost
  credentials: *credentials
  type: *dbType
  referenceFormulation: *csvReferenceFormulation
  # Creation des champs suivants :
  # - date_naissance = date de naissance complète yyyy-mm-dd ou yyyy
  # - date_naissance_dtype = date ou gYear
  # - idem pour date_deces et date_deces_dtype
  # - FRBNF = numéro de notice BnF
  # - gender = Male ou Female
  query: >
   SELECT (CASE WHEN (jour_naissance IS NOT NULL AND mois_naissance IS NOT NULL AND annee_naissance IS NOT NULL) 
   THEN CONCAT_WS('-', LPAD(`annee_naissance`, 4, 0), LPAD(`mois_naissance`, 2, 0), LPAD(`jour_naissance`, 2, 0)) 
   WHEN (annee_naissance IS NOT NULL) THEN annee_naissance END) AS date_naissance, 
   (CASE WHEN (jour_naissance IS NOT NULL AND mois_naissance IS NOT NULL AND annee_naissance IS NOT NULL) 
   THEN 'date' WHEN (annee_naissance IS NOT NULL) THEN 'gYear' END) AS date_naissance_dtype, 
   (CASE WHEN (jour_deces IS NOT NULL AND mois_deces IS NOT NULL AND annee_deces IS NOT NULL) 
   THEN CONCAT_WS('-', LPAD(`annee_deces`, 4, 0), LPAD(`mois_deces`, 2, 0), LPAD(`jour_deces`, 2, 0)) 
   WHEN (annee_deces IS NOT NULL) THEN annee_deces END) AS date_deces, 
   (CASE WHEN (jour_deces IS NOT NULL AND mois_deces IS NOT NULL AND annee_deces IS NOT NULL) 
   THEN 'date' WHEN (annee_deces IS NOT NULL) THEN 'gYear' END) AS date_deces_dtype,   
   SUBSTRING(ark_bnf, 9, CHAR_LENGTH(ark_bnf) - 9) AS FRBNF, 
   IF(sexe = 'M', REPLACE(sexe, 'M', 'male'),IF(sexe = 'F', REPLACE(sexe, 'F', 'female'),'')) AS gender, 
   prelib_personne.* FROM prelib_personne ORDER BY id ;
 prelib_profession:
  access: *dbHost
  credentials: *credentials
  type: *dbType
  referenceFormulation: *csvReferenceFormulation
  query : SELECT * FROM prelib_profession;
 prelib_revue:
  access: *dbHost
  credentials: *credentials
  type: *dbType
  referenceFormulation: *csvReferenceFormulation
  query : SELECT * FROM prelib_revue;
 prelib_ville:
  access: *dbHost
  credentials: *credentials
  type: *dbType
  referenceFormulation: *csvReferenceFormulation   
  query : SELECT * FROM prelib_ville;    

# A FAIRE : 
# - Mettre les triplets dans des graphes, généred des N-Quads
# - Modifier PRELIB pour ajouter les identifiants Geonames des villes
# - Editer PRELIB pour verser le champ 'intitule' de prelib_profession dans qualite_id (prelib_qualiteparticipecollectif)
# - Exprimer les informations de prelib_ecritoeuvre avec les ontologies utilisées par la BnF et l'Abes

# Ne faudrait-il pas distinguer la notice PRELIB de ce dont parle la notice comme le fait IdRef 
# et la BnF ? ex. sur data.bnf.fr : DESCRIBE <http://data.bnf.fr/ark:/12148/cb16623239r#about>
# qui comprend le triplet :
# <http://data.bnf.fr/ark:/12148/cb16623239r> <http://xmlns.com/foaf/0.1/focus> <http://data.bnf.fr/ark:/12148/cb16623239r#about>

mappings:
# Inspiré de data.bnf.fr : DESCRIBE <http://data.bnf.fr/ark:/12148/cb16623239r#about>
# Voir aussi Théodore Hersart de la Villemarqué http://data.bnf.fr/ark:/12148/cb119103377#about et http://www.idref.fr/026956845/id
# Voir aussi Michel Serres http://www.idref.fr/02713329X/id
# Problèmes : https://github.com/RMLio/rmlmapper-java/issues/159 (RML Mapper ne tient pas compte des colonnes vides de type VARCHAR)
 PersonneMapping:
  sources: prelib_personne
  subjects: prelib:personne/$(id)
  predicateobjects:
   - [a, crm:E21_Person]
   - [rdfs:label, $(nom_usuel)]
   - [foaf:name, $(nom_usuel)]
   - [foaf:familyName, $(nom_etat_civil)]
   - [foaf:givenName, $(prenom_etat_civil)]
   - [bnf-onto:FRBNF, $(FRBNF)]
   - [isni:identifierValid, $(isni)]
   - [owl:sameAs, http://www.wikidata.org/entity/$(wikidata)~iri]
   - [owl:sameAs, http://www.idref.fr/$(idref)~iri]
   - [owl:sameAs, http://data.bnf.fr/ark:/$(ark_bnf)#foaf:Person~iri]
   - [owl:sameAs, http://isni.org/isni/$(isni)~iri]
   - [owl:sameAs, http://viaf.org/viaf/$(viaf)~iri]
   - [crm-sup:P20_same_as_URI, http://www.wikidata.org/entity/$(wikidata)~iri]
   - [crm-sup:P20_same_as_URI, http://www.idref.fr/$(idref)~iri]
   - [crm-sup:P20_same_as_URI, http://data.bnf.fr/ark:/$(ark_bnf)#foaf:Person~iri]
   - [crm-sup:P20_same_as_URI, http://isni.org/isni/$(isni)~iri]
   - [crm-sup:P20_same_as_URI, http://viaf.org/viaf/$(viaf)~iri]
   - [foaf:gender, $(gender), en~lang]
   - p: crm:P1_is_identified_by
     o:
      - mapping: AppellationPersonneMapping
        condition:
         function: equal
         parameters:
          - [str1, $(id)]
          - [str2, $(personne_id)]
   - p: crm:P98_was_born
     o:
      - mapping: BirthMapping
        condition:
         function: equal
         parameters:
          - [str1, $(id)]
          - [str2, $(id)]
   - p: crm:P100_died_in 
     o:
      - mapping: DeathMapping
        condition:
         function: equal
         parameters:
          - [str1, $(id)]
          - [str2, $(id)]           

 # Avec la proposition de Vincent et Francesco, il n'y a pas de dictinction entre les formes retenues et les formes rejettées
 # blank nodes (ressources anonymes)
 AppellationPersonneMapping:
  sources: prelib_appellationpersonne
  predicateobjects:
  - [a, crm:E41_Appellation]
  - [crm-sup:P21_has_value, $(appellation), $(code_iso_639_3)~lang]

 AppellationRetenuePersonneMapping:
  sources: prelib_appellationpersonne
  subjects: prelib:personne/$(personne_id)
  predicateobjects:
   - [skos:prefLabel, $(appellation), $(code_iso_639_3)~lang]
  condition:
    function: idlab-fn:equal
    parameters:
     - [grel:valueParameter, $(forme)]
     - [grel:valueParameter2, "2"]

 AppellationRejetteePersonneMapping:
  sources: prelib_appellationpersonne
  subjects: prelib:personne/$(personne_id)
  predicateobjects:
   - [skos:altLabel, $(appellation), $(code_iso_639_3)~lang]
  condition:
    function: idlab-fn:equal
    parameters:
     - [grel:valueParameter, $(forme)]
     - [grel:valueParameter2, "1"]

 DeathMapping:
  sources: prelib_personne
  predicateobjects:
   - [a, crm:E69_Death]
   - p: crm:P7_took_place_at
     o:
      - mapping: VilleMapping
        condition:
         function: equal
         parameters:
          - [str1, $(ville_deces_id)]
          - [str2, $(id)]
   - p: crm:P4_has_time_span
     o:
      - mapping: TimeSpanDeathMapping
        condition:
         function: equal
         parameters:
          - [str1, $(id)]
          - [str2, $(id)]    

 BirthMapping:
  sources: prelib_personne
  predicateobjects:
   - [a, crm:E67_Birth]
   - p: crm:P7_took_place_at
     o:
      - mapping: VilleMapping
        condition:
         function: equal
         parameters:
          - [str1, $(ville_naissance_id)]
          - [str2, $(id)]
   - p: crm:P4_has_time_span
     o:
      - mapping: TimeSpanBirthMapping
        condition:
         function: equal
         parameters:
          - [str1, $(id)]
          - [str2, $(id)] 

 # yarrrml-parser ne tient pas compte de l'utilisation de la référence pour forger le datatype
 # créé une issue https://github.com/RMLio/yarrrml-parser/issues/162#issuecomment-1104892337
 TimeSpanBirthMapping:
  sources: prelib_personne
  predicateobjects:
   - [a, crm:E52_Time-Span]
   - p: crm:P82_at_some_time_within
     o: $(date_naissance)
     datatype: xsd:gYear
     condition:
      function: equal
      parameters:
       - [str1, $(date_naissance_dtype)]
       - [str2, 'gYear']
   - p: crm:P82_at_some_time_within
     o: $(date_naissance)
     datatype: xsd:date
     condition:
      function: equal
      parameters:
       - [str1, $(date_naissance_dtype)]
       - [str2, 'date']

 TimeSpanDeathMapping:
  sources: prelib_personne
  predicateobjects:
   - [a, crm:E52_Time-Span]
   - p: crm:P82_at_some_time_within
     o: $(date_deces)
     datatype: xsd:gYear
     condition:
      function: equal
      parameters:
       - [str1, $(date_deces_dtype)]
       - [str2, 'gYear']
   - p: crm:P82_at_some_time_within
     o: $(date_deces)
     datatype: xsd:date
     condition:
      function: equal
      parameters:
       - [str1, $(date_deces_dtype)]
       - [str2, 'date']

 CollectifMapping:
  sources: prelib_collectif
  subjects: prelib:collectif/$(id)
  predicateobjects:
   - [a, crm:E74_Group]
   - [rdfs:label, $(nom, fr~lang]

 EditionMapping:
  sources: prelib_edition
  subjects: prelib:edition/$(id)
  predicateobjects:
   - [a, frbroo:F3_Manifestation_Product_Type]
   - [rdfs:label, $(titre)]
   - [crm:P102_has_title, $(titre)]
   - [owl:sameAs, http://www.wikidata.org/entity/$(wikidata)~iri]
   - [owl:sameAs, http://www.sudoc.org/$(sudoc)~iri]
   - [owl:sameAs, http://data.bnf.fr/ark:/$(ark_bnf)~iri]
   - [crm-sup:P20_same_as_URI, http://www.wikidata.org/entity/$(wikidata)~iri]
   - [crm-sup:P20_same_as_URI, http://www.sudoc.org/$(sudoc)~iri]
   - [crm-sup:P20_same_as_URI, http://data.bnf.fr/ark:/$(ark_bnf)~iri]
   - [bibo:isbn10, $(isbn_10)]
   - [bibo:isbn13, $(isbn_13)]   

 VilleMapping:
  sources: prelib_ville
  subjects: prelib:ville/$(id)
  predicateobjects:
   - [a, sdh:C13_Geographical_Place]
   - [rdfs:label, $(nom), fr~lang]
   - [owl:sameAs, wd:$(wikidata)~iri]
   - [crm-sup:P20_same_as_URI, wd:$(wikidata)~iri]       

Thank you,

bjdmeest commented 1 year ago

Dear @JBPressac , there's a small regression at v6 giving problems with command line arguments using relative paths without ./, so eg -m mapping.rml.ttl could give a problem, whilst -m ./mapping.rml.ttl should work. There's a fix on the way, but in the meantime I hope this helps.

JBPressac commented 1 year ago

@bjdmeest thank you for your suggestion, alas, it does not solve my problem....

bjdmeest commented 1 year ago

Hmm, have you tried that with all CLI path arguments? also, e.g., the output path? If so, could you send us the CLI argument's you're using?

JBPressac commented 1 year ago

OK, you are right, I forgot to apply the ./ to the output path, sorry.