carml / carml-jar

A CLI for CARML
MIT License
2 stars 2 forks source link

Add File Iteration Option #64

Open tobiasschweizer opened 1 year ago

tobiasschweizer commented 1 year ago

Hi there,

I would like to propose a feature request for a file iteration option.

Use case

I have a use case with roughly 35'000 XML source files that I want to convert to RDF.

Approaches Taken With CARML JAR

So far, I have taken two approaches:

  1. Call the CARML Jar for each input file: this means calling the JAR > 35'000 times. With each call, there is a constant loss of time regardless of the size / complexity of the XML file I want to map. Therefore, this is not an efficient way to do it.
  2. Generate a virtual XML file with a virtual root element that can be used in the mapping's iterators. Example: (echo "<projects>"; cat cordis/project-rcn-*.xml ; echo "</projects>") | java -jar carml-jar-1.0.0-SNAPSHOT-0.4.4.jar map -m rml-cordis_mapping.ttl -of ttl -p schema=http://schema.org/. This approach is quite performant but it requires a change in the mapping's rml:iterators (/projects) and this virtual root element does not exist in the actual XML source. Also debugging is hard in case an error occurs. Another issue is validation which is quite straight forward if it can be applied to multiple single RDF outputs (instead of one huge output graph).

Feature Request

I would be convenient if CARML JAR would accept an input directory option -id and iterate over the files contained in this directory. This feature would also work for JSON input files but would probably not make sense for CSV files. I think it would also be convenient if this could be achieved using carml:Stream in the mapping itself. This way CARML JAR would completely control the handling of the input.

Ideally, there would be a corresponding output directory option -od where the results would be written to (as single files). So there would be a one-to-one relation between input XML/JSON and output RDF files. I would find this convenient because invalid graphs (as individual files) could be looked at separately and would not block the whole process.

Example of usage: java -jar carml-jar-1.0.0-SNAPSHOT-0.4.4.jar map -m rml-cordis-mapping.ttl -of ttl -p schema=http://schema.org/ -id cordis -od out

Mapping:

<#LogicalSourceProject> a rml:BaseSource ;
    rml:source [
        a carml:Stream
    ] ;
    rml:referenceFormulation ql:XPath ;
    rml:iterator "/project" .

If an error occurs, the name of the input XML/JSON file should be indicated too. Then these files could be removed from the input directory and looked at separately.

Let me know if this description is clear and where I could help.

pmaria commented 1 year ago

@tobiasschweizer thanks for this. The description is clear. I have to think a bit on how to best approach this. Will get back to you on that.

ktk commented 1 year ago

@tobiasschweizer just FYI we processed 50 million XML files using our https://github.com/zazuko/carml-service/. It is very fast that way and we then have pipelines that simply process mapping & data and send it to the service. Let me know when you need some boilerplate, you still have some hours in a contract with us :)

pmaria commented 1 year ago

@ktk that's pretty cool!

tobiasschweizer commented 1 year ago

@ktk nice to hear from you, it has been a while :-)

I am aware of the existence of https://github.com/zazuko/carml-service/. We talked about this back in July together with Bart. 50 million XML files is an impressive number. I am still an RML dwarf.

This is why I haven't used https://github.com/zazuko/carml-service/ so far:

  1. no automatic update of the carml-service when a new version of CARML is available. Currently, it uses CARML 0.4.1 (https://github.com/zazuko/carml-service/commit/d0842fb27694563019802e2706d1ebf29e303e4f). I think ideally there should be an automated process that makes a new release once a new CARML version is available.
  2. effort of setting up and maintaining a tomcat server to run the WAR

If we could find a solution for 1. I will happy to try 2. :-)

In any case, I would be happy if someone could review our current pipeline.