RMLio / RMLStreamer

The RMLStreamer executes RML rules to generate high quality Linked Data from multiple originally (semi-)structured data sources in a streaming way.
http://rml.io/
MIT License
48 stars 18 forks source link

Support for functions? #16

Open micheldumontier opened 4 years ago

micheldumontier commented 4 years ago

Is there currently any support for functions? if so, is there documentation for this? if not, will there be, and when? thanks!

ghsnd commented 4 years ago

Dear prof. Dumontier, support for functions is on the roadmap. We have no strict deadline in mind, but depending on the resources, we're targetting this summer.

ghsnd commented 4 years ago

However, we do have a preliminary version, which still needs work and is not thoroughly tested yet. What exactly do you need?

micheldumontier commented 4 years ago

oh that's cool. would very much like to be your alpha tester :)

I need, in order of importance: 1) join (to make uris from one or more fields) 2) split (to break multivalued strings with a separator) 3) uuid (to generate unique uris) 4) toLowerCase 5) toUpperCase 6) trim 7) regex (to extract specifc substrings) 8) date/datetime (to generate date/time provenance of processing)

mielvds commented 4 years ago

any updates @ghsnd ? Sign me up for alpha user as well ;)

ghsnd commented 3 years ago

Hi, support for functions is available with the new release. Included functions:

  1. join (prob_array_join) (although making uris from fields can also be done by using templates)
  2. split (grel:string_split)
  3. uuid (idlab-fn:random)
  4. toLowerCase (grel:toLowerCase)
  5. toUpperCase (grel:toUpperCase)
  6. trim(grel:string_trim)

Other functions can be added by following these instructions.

vemonet commented 3 years ago

Hi @ghsnd we tried the RMLStreamer with mappings using the grel:string_split function on a CSV file, and it did not work

We used the latest release 2.1.1 of the RMLStreamer.jar with the Flink image supported by this release (we reused the same image found in your docker-compose.yml file at the tag 2.1.1)

The YARRRML file we use:

prefixes:
  rdf: "http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  rdfs: "http://www.w3.org/2000/01/rdf-schema#"
  xsd: "http://www.w3.org/2001/XMLSchema#"
  grel: "http://users.ugent.be/~bjdmeest/function/grel.ttl#"
  idlab: "http://example.com/idlab/function/"
  idsf: "https://w3id.org/um/ids/rmlfunctions.ttl#"
  pubmed: "https://identifiers.org/pubmed:"
  drugbank: "https://identifiers.org/drugbank:"
  mesh: "https://identifiers.org/mesh:"
  uniprot: "https://identifiers.org/uniprot:"
  omim: "https://identifiers.org/mim:"
  schema: "https://schema.org/"
  sio: "http://semanticscience.org/resource/"
  bio2kg: "https://w3id.org/bio2kg/data/"
  ncbigene: "https://identifiers.org/ncbigene:"
  ncbitaxon: "http://purl.org/obo/owl/NCBITaxon#"

mappings:
  proteins:
    sources:
      - ['iproclass.csv~csv']
    s: uniprot:$(UniProtKB accession)
    po:
      - [a, sio:Protein]
      - [sio:hasProvider, bio2kg:graph/iproclass~iri]
      - [sio:affects, ncbitaxon:$(NCBI taxonomy)~iri]  # 9606
      - p: sio:isSupportedBy
        o:
            function: grel:string_split
            parameters:
                - [grel:p_string_sep, ";"]
                - [grel:valueParameter, $(PubMed)]

Here a sample of the CSV file:

UniProtKB accession,UniProtKB ID,EntrezGene,RefSeq,NCBI GI number,PDB,Pfam,GO,PIRSF,IPI,UniRef100,UniRef90,UniRef50,UniParc,PIR-PSD accession,NCBI taxonomy,MIM,UniGene,Ensembl,PubMed ID,EMBL GenBank DDBJ,EMBL protein_id
"Q6GZX4","001R_FRG3G","2947773","YP_031579.1","81941549; 49237298","","PF04947","GO:0046782","","","UniRef100_Q6GZX4","UniRef90_Q6GZX4","UniRef50_Q6GZX4","UPI00003B0FD4","","654924; 654925","","","","15165820","AY548484","AAT09660.1"
"Q6GZX3","002L_FRG3G","2947774","YP_031580.1","49237299; 81941548","","PF03003","GO:0033644; GO:0016021","","","UniRef100_Q6GZX3","UniRef90_Q6GZX3","UniRef50_Q6GZX3","UPI00003B0FD5","","654924; 654925; 654926","","","","15165820","AY548484","AAT09661.1"

No error are raised, all the regular predicateobjects are generated apart from the one with the function.

The split function of those mappings works with the RMLmapper (you can try it directly here: https://rml.io/yarrrml/matey/#)

We run the RMLStreamer using the Flink CLI:

/opt/flink/bin/flink run -p 8 -c io.rml.framework.Main /mnt/RMLStreamer.jar toFile -m /mnt/mapping.rml.ttl -o /mnt/output.nt --job-name "RMLStreamer job"

Is there anything we need to do to make the functions work?

Note we also tried to add custom functions following the documentation (either by adding the jar and ttl files to the right folders, or by recompiling the RMLStreamer.jar), but the RMLStreamer could not find the function we added (the same custom functions works with the RMLmapper)