RMLio / rmlmapper-java

The RMLMapper executes RML rules to generate high quality Linked Data from multiple originally (semi-)structured data sources
http://rml.io
MIT License
146 stars 61 forks source link

CSV With Byte Order Mark: Accessing First Column's Name #171

Closed tobiasschweizer closed 2 years ago

tobiasschweizer commented 2 years ago

Hi there,

I work with a CSV and encountered a problem when trying to access the first column's name.

CSV: https://data.snf.ch/Exportcsv/GrantWithAbstracts.csv

mapping.ttl:

@prefix csvw: <http://www.w3.org/ns/csvw#> .
@prefix rr: <http://www.w3.org/ns/r2rml#>.
@prefix rml: <http://semweb.mmlab.be/ns/rml#>.
@prefix ql: <http://semweb.mmlab.be/ns/ql#>.
@prefix xsd: <http://www.w3.org/2001/XMLSchema#>.
@prefix schema: <http://schema.org/>.
@prefix wgs84_pos: <http://www.w3.org/2003/01/geo/wgs84_pos#lat>.
@prefix gn: <http://www.geonames.org/ontology#>.
@prefix carml: <http://carml.taxonic.com/carml/> .
@prefix fnml: <http://semweb.mmlab.be/ns/fnml#> .
@prefix grel: <http://users.ugent.be/~bjdmeest/function/grel.ttl#> .
@prefix fno: <https://w3id.org/function/ontology#> .
@base <http://example.com/ns#>.

<#LogicalSourceGrant> a rml:BaseSource ;
  rml:source <#CSVW_sourceGrant> ;
  rml:referenceFormulation ql:CSV .

<#CSVW_sourceGrant> a csvw:Table;
   csvw:url "GrantWithAbstracts.csv" ;
   csvw:dialect [ a csvw:Dialect;
       csvw:delimiter ";"
   ] .

<#ProjectMapping> a rr:TriplesMap;
  rml:logicalSource <#LogicalSourceGrant> ;

  rr:subjectMap [
    rr:template "http://snf.ch/project/{GrantNumber}";
    rr:class schema:ResearchProject
  ] ;

  rr:predicateObjectMap [
    rr:predicate schema:description ;
    rr:objectMap [
      rml:reference "Abstract" # first column's name
    ]
  ] .

java -jar rmlmapper-5.0.0-r362-all.jar -m mapping.ttl -s jsonld returns:

09:34:31.754 [main] ERROR be.ugent.rml.cli.Main .main(393) - Mapping for Abstract not found, expected one of [LaySummaryLead_En, MainDiscipline_Level1, GrantNumberString, Keywords, ResponsibleApplicantName, GrantNumber, CallFullTitle, LaySummaryLead_It, MainDiscipline_Level2, AmountGrantedAllSets, Institute, CallDecisionYear, LaySummary_Fr, FundingInstrumentReporting, MainDiscipline, LaySummaryLead_Fr, AllDisciplines, LaySummaryLead_De, Title, FundingInstrumentLevel1, TitleEnglish, LaySummary_De, Abstract, ResearchInstitution, EffectiveGrantEndDate, FundingInstrumentPublished, MainDisciplineNumber, InstituteCountry, LaySummary_En, State, LaySummary_It, CallEndDate, EffectiveGrantStartDate]

Encoding: file -I GrantWithAbstracts.csv

GrantWithAbstracts.csv: text/plain; charset=utf-8

I looked at the file using hexdump -c GrantWithAbstracts.csv | less:

0000000 <EF> <BB> <BF> A b s t r a c t ; A l l D

I figured that this is a byte order mark (BOM). I understand that the presence of a BOM is not required:

Byte order has no meaning in UTF-8,[5] so its only use in UTF-8 is to signal at the start that the text stream is encoded in UTF-8, or that it was converted to UTF-8 from a stream that contained an optional BOM.

(https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8)

Is it possible that the BOM is (mistakenly) somehow part of the first column's name (and thus "Abstract" is not found)?

Thanks a lot for any hint.

DylanVanAssche commented 2 years ago

Found a fix, will be included in next release, thanks for reporting!

tobiasschweizer commented 2 years ago

Found a fix, will be included in next release, thanks for reporting!

That's great. Thank you very much for looking into this.