NCEAS / adc-disciplines

Discipline taxonomy derived from re3data/DFG subject classification
1 stars 1 forks source link

identifiers with spurious unicode characters #2

Closed mbjones closed 2 years ago

mbjones commented 2 years ago

The R script that generates ADCAT.ttl does so with a function that creates the class names for each of the terms. Those are drawn from the integer identifiers in the CSV file, which are then padded to create the class names, like so:

odo:ADCAT_00013
    a owl:Class ;
    rdfs:label "Biochemistry" ;
    rdfs:subClassOf odo:ADCAT_00011 .

The padding works fine for identifiers > 10, but for identifiers 0 to 9, it adds a spurious unicode character \u002 and mangles the prefixed URI format, as follows:

<ADCAT_000\u00202>
    a owl:Class ;
    rdfs:label "Humanities" ;
    rdfs:subClassOf odo:ADCAT_00001 .

Interestingly the parent reference in the subclass triple seems to be created fine. Not sure what's up here. @amoeba would love to get your review of this if you have a minute.

mbjones commented 2 years ago

NM, figured it out -- it was because I was parsing the id with a %s format string, when I should have been treating it as a numeric format with %d. Simple change, and the TTL file now looks correct. Committed in sha 005e9f6fe0971c1dd9be756badbf47a706daed82.