Open tobiasschweizer opened 1 year ago
@tobiasschweizer thanks for this. The description is clear. I have to think a bit on how to best approach this. Will get back to you on that.
@tobiasschweizer just FYI we processed 50 million XML files using our https://github.com/zazuko/carml-service/. It is very fast that way and we then have pipelines that simply process mapping & data and send it to the service. Let me know when you need some boilerplate, you still have some hours in a contract with us :)
@ktk that's pretty cool!
@ktk nice to hear from you, it has been a while :-)
I am aware of the existence of https://github.com/zazuko/carml-service/. We talked about this back in July together with Bart. 50 million XML files is an impressive number. I am still an RML dwarf.
This is why I haven't used https://github.com/zazuko/carml-service/ so far:
If we could find a solution for 1. I will happy to try 2. :-)
In any case, I would be happy if someone could review our current pipeline.
Hi there,
I would like to propose a feature request for a file iteration option.
Use case
I have a use case with roughly 35'000 XML source files that I want to convert to RDF.
Approaches Taken With CARML JAR
So far, I have taken two approaches:
(echo "<projects>"; cat cordis/project-rcn-*.xml ; echo "</projects>") | java -jar carml-jar-1.0.0-SNAPSHOT-0.4.4.jar map -m rml-cordis_mapping.ttl -of ttl -p schema=http://schema.org/
. This approach is quite performant but it requires a change in the mapping'srml:iterator
s (/projects
) and this virtual root element does not exist in the actual XML source. Also debugging is hard in case an error occurs. Another issue is validation which is quite straight forward if it can be applied to multiple single RDF outputs (instead of one huge output graph).Feature Request
I would be convenient if CARML JAR would accept an input directory option
-id
and iterate over the files contained in this directory. This feature would also work for JSON input files but would probably not make sense for CSV files. I think it would also be convenient if this could be achieved usingcarml:Stream
in the mapping itself. This way CARML JAR would completely control the handling of the input.Ideally, there would be a corresponding output directory option
-od
where the results would be written to (as single files). So there would be a one-to-one relation between input XML/JSON and output RDF files. I would find this convenient because invalid graphs (as individual files) could be looked at separately and would not block the whole process.Example of usage:
java -jar carml-jar-1.0.0-SNAPSHOT-0.4.4.jar map -m rml-cordis-mapping.ttl -of ttl -p schema=http://schema.org/ -id cordis -od out
Mapping:
If an error occurs, the name of the input XML/JSON file should be indicated too. Then these files could be removed from the input directory and looked at separately.
Let me know if this description is clear and where I could help.