anuzzolese / pyrml

pyRML is a Python based engine for processing RML files. The RDF Mapping Language (RML) is a mapping language defined to express customized mapping rules from heterogeneous data structures and serializations to the RDF data model. RML is defined as a superset of the W3C-standardized mapping language R2RML, aiming to extend its applicability and broaden its scope, adding support for data in other structured formats.
Apache License 2.0
33 stars 12 forks source link

Input file path and type not abstracted from rml mapping #10

Open henrieglesorotos opened 1 year ago

henrieglesorotos commented 1 year ago

Currently the input file can't be parameterised via cli or api. It is hardcoded into the mapping file. Eg:

rml:logicalSource [ 
    rml:source "./examples/artists/Artist.csv" ;
    rml:referenceFormulation ql:CSV
  ]

It would be more flexible to be able to provide this as a parameter.

henrieglesorotos commented 1 year ago

Reckon it's something we could work on @anuzzolese? Also are there any tests?

anuzzolese commented 1 year ago

Hi @henrieglesorotos, if i got the problem you are referring to correctly I would say that it is somehow implemented (maybe not the best solution, but we can discuss about improvements). In fact, pyrml supports the parametrisation of RML mapping files by relying on Jinja2.

RML files processed by pyrml can accepts parameters as Jinja2 does, e.g.:

rml:logicalSource [ 
    rml:source {{ source_file }};
    rml:referenceFormulation ql:CSV
  ]

Than when you instantiate your mapper in the Python code you can do something like this:

from pyrml import RMLConverter
from rdflib import Graph

rml_map_file: str = '/path_to_your_rml'

# here you create a dictionary for linking actual values to the parameter defined in the RML files (i.e. 'source_file').
vars = {'source_file': './examples/artists/Artist.csv'}

rml_mapper: RMLConverter = RMLConverter.get_instance()
g: Graph = rml_mapper.convert(rml_map_file, template_vars=vars)
henrieglesorotos commented 1 year ago

This is excellent news! Can we add to the docs? Also - shall we create some simple tests if they don't exist?

anuzzolese commented 1 year ago

Yes, controbuting in documenting and providing how-to guides would be utmost helpful.

henrieglesorotos commented 1 year ago

@anuzzolese

Having some issues. See example below:

We have some pre-existing rml rules in mapping.ttl:

@prefix rr: <http://www.w3.org/ns/r2rml#>.
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
@prefix fnml: <http://semweb.mmlab.be/ns/fnml#>.
@prefix fno: <https://w3id.org/function/ontology#>.
@prefix d2rq: <http://www.wiwiss.fu-berlin.de/suhl/bizer/D2RQ/0.1#>.
@prefix void: <http://rdfs.org/ns/void#>.
@prefix dc: <http://purl.org/dc/terms/>.
@prefix foaf: <http://xmlns.com/foaf/0.1/>.
@prefix rml: <http://semweb.mmlab.be/ns/rml#>.
@prefix ql: <http://semweb.mmlab.be/ns/ql#>.
@prefix : <http://mapping.example.com/>.
@prefix dcterms: <http://purl.org/dc/terms/>.
@prefix skos: <http://www.w3.org/2004/02/skos/core#>.
@prefix industries: <https://data.beamery.com/naics/2022/industries/>.

:rules_000 a void:Dataset.
:source_000 a rml:LogicalSource;
    rml:source "input.json";
    rml:iterator "$";
    rml:referenceFormulation ql:JSONPath.
:rules_000 void:exampleResource :map_Concept_000.
:map_Concept_000 rml:logicalSource :source_000;
    a rr:TriplesMap;
    rdfs:label "Concept".
:s_000 a rr:SubjectMap.
:map_Concept_000 rr:subjectMap :s_000.
:s_000 rr:template "https://data.beamery.com/naics/2022/industries/{NAICS22}#this";
    rr:graphMap :gm_000.
:gm_000 a rr:GraphMap;
    rr:template "https://data.beamery.com/naics/2022/industries/{NAICS22}".
:pom_000 a rr:PredicateObjectMap.
:map_Concept_000 rr:predicateObjectMap :pom_000.
:pm_000 a rr:PredicateMap.
:pom_000 rr:predicateMap :pm_000.
:pm_000 rr:constant skos:example.
:pom_000 rr:objectMap :om_000.
:om_000 a rr:ObjectMap;
    rml:reference "Index Item Description";
    rr:termType rr:Literal;
    rml:languageMap :language_000.
:language_000 rr:constant "en".

Input file: input.json

{"NAICS22":"315990","Index Item Description":"Hats, cloth, cut and sewn from purchased fabric (except apparel contractors)"}

I am getting:

python converter.py -o test.ttl mapping.ttl
Traceback (most recent call last):
  File "/Users/henrieglesorotos/repos/pyrml/converter.py", line 65, in <module>
    PyrmlCMDTool().do_map()
  File "/Users/henrieglesorotos/repos/pyrml/converter.py", line 34, in do_map
    g = rml_converter.convert(self.__args.input, self.__args.m)
  File "/Users/henrieglesorotos/repos/pyrml/pyrml/pyrml_mapper.py", line 131, in convert
    triple_mappings = RMLParser.parse(rml_mapping)
  File "/Users/henrieglesorotos/repos/pyrml/pyrml/pyrml_mapper.py", line 46, in parse
    return TripleMappings.from_rdf(g)
  File "/Users/henrieglesorotos/repos/pyrml/pyrml/pyrml_core.py", line 1586, in from_rdf
    return set([TripleMappings.__build(g, row) for row in qres])
  File "/Users/henrieglesorotos/repos/pyrml/pyrml/pyrml_core.py", line 1586, in <listcomp>
    return set([TripleMappings.__build(g, row) for row in qres])
  File "/Users/henrieglesorotos/repos/pyrml/pyrml/pyrml_core.py", line 1594, in __build
    predicate_object_maps = PredicateObjectMap.from_rdf(g, row.tm)
  File "/Users/henrieglesorotos/repos/pyrml/pyrml/pyrml_core.py", line 752, in from_rdf
    return list(map(lmbd(g), qres))
  File "/Users/henrieglesorotos/repos/pyrml/pyrml/pyrml_core.py", line 751, in <lambda>
    lmbd = lambda graph : lambda row :  PredicateObjectMap.__build(graph, row)
  File "/Users/henrieglesorotos/repos/pyrml/pyrml/pyrml_core.py", line 758, in __build
    predicates = PredicateBuilder.build(g, row.pom)
  File "/Users/henrieglesorotos/repos/pyrml/pyrml/pyrml_core.py", line 669, in build
    predicates += PredicateMap.from_rdf(g, predicate_ref)
  File "/Users/henrieglesorotos/repos/pyrml/pyrml/pyrml_core.py", line 629, in from_rdf
    pm = PredicateMap(row.tripleMap, row.map, row.termType, row.predicateMap)
  File "/Users/henrieglesorotos/repos/pyrml/venv/lib/python3.9/site-packages/rdflib/query.py", line 124, in __getattr__
    raise AttributeError(name)
AttributeError: tripleMap

Any ideas?

henrieglesorotos commented 1 year ago

Btw - we generally work in yarrrml so it's simpler, and then convert using https://github.com/RMLio/yarrrml-parser

henrieglesorotos commented 1 year ago

FYI:

python --version == 3.9.0

pip freeze

click==8.1.7
decorator==5.1.1
Flask==2.2.2
importlib-metadata==6.8.0
isodate==0.6.1
itsdangerous==2.1.2
Jinja2==3.1.2
jsonpath-ng==1.5.3
lark-parser==0.12.0
MarkupSafe==2.1.3
numpy==1.23.4
pandas==1.5.1
ply==3.11
pyparsing==3.1.1
pyrml==0.3.0
python-dateutil==2.8.2
python-slugify==7.0.0
pytz==2023.3.post1
rdflib==6.2.0
shortuuid==1.0.9
six==1.16.0
SPARQLWrapper==2.0.0
text-unidecode==1.3
Unidecode==1.3.7
werkzeug==3.0.1
zipp==3.17.0
henrieglesorotos commented 1 year ago

Did you manage to replicate this @anuzzolese ?