rmlmapper should consume inputs and produce outputs in a streaming fashion whenever possible

IanEmmons commented 3 years ago

I am using rmlmapper to translate some very large CSV files into RDF. These files have hundreds of thousand of rows. If I try to translate them directly, rmlmapper runs out of memory and dies. I have worked around this for now by splitting my CSV files into smaller files. However, there is no reason why I should need to do this. Because rmlmapper must translate a CSV file one row at a time, with the translation of each row independent of any other row, there is no reason why rmlmapper cannot read one row at a time from the file, translate it, and then read the next row. (The common CSV parsers all have this capability.) This behavior is what I mean when I say "consume inputs in a streaming fashion." However, rmlmapper instead appears to read the entire CSV into memory at once. This exact same improvement could be easily made when consuming a query result from a relational database.

I realize that the same input streaming cannot work when consuming XML (and probably JSON), because XPath expressions in the RML can refer to any part of the XML document at any time.

Similar concerns apply to output. Ideally, output options should be available that allow rmlmapper to write each RDF statement as it is generated and then not keep it in memory until the end of the translation. When invoked via the command line interface, this could be accomplished by appending each statement to an N-Triples or N-Quads file. (A useful optimization, there, is to using Java's ZipOutputStream to write an N-Triples file inside a zip file.) When invoked as an in-process library, the calling code could supply a triple sink to rmlmapper, so that rmlmapper could pass each triple as it is generated and then the calling code could decide what to do with it.

These enhancements would improve the scalability and performance of rmlmapper considerably.

RichDijk commented 3 years ago

Have you tried RDFizer https://github.com/SDM-TIB/SDM-RDFizer ? Might be an indermediate solution.

IanEmmons commented 3 years ago

I have not tried that yet. I have for now worked around this issue by preprocessing my CSV files to split them into smaller files, and this does work. Thanks for the pointer to RDFizer, though. (I also found CARML, but I haven't tried that one yet either.)

DylanVanAssche commented 3 years ago

@IanEmmons Thanks for your interest in RML!

We're aware of that issue, and have therefore the RMLStreamer to handle huge datasets in a streaming and scalable fashion! You can find it here: https://github.com/RMLio/RMLStreamer

Can you give it a shot to see if that works for your use case?

IanEmmons commented 3 years ago

It will take me quite some time to re-tool around alternate tools:

SDM-RDFizer Is a Python library, but my infrastructure is Java. (Their choice of Python strikes me as odd — Python's standard RDF library, RDFlib, is horribly slow.)
RMLStreamer requires a substantial investment in learning and deploying Flink
CARML uses a different dialect of RML, and does not have good documentation to enable an easy translation.

Further, my deployment environment has restrictive policies that require time-consuming approval of new libraries and tools before I can use them. So, I would much prefer to stick with rmlmapper if possible. The specific improvement I have in mind, here, should not be hard to implement. Assuming you use commons-csv to parse CSV files, the change would be from this:

try (
   InputStream is = ...;
   CSVParser parser = ...;
) {
   List<CSVRecord> records = parser.getRecords();
   for (CSVRecord record : records) {
      processOneCsvRow();
   }
}

to this:

try (
   InputStream is = ...;
   CSVParser parser = ...;
) {
   for (CSVRecord record : parser) {
      processOneCsvRow();
   }
}

Of course, the actual enhancement is undoubtedly a bit more complex, but even so this doesn't seem to be a heavy lift, and the benefit is large.

DylanVanAssche commented 3 years ago

@IanEmmons Thanks for your suggestion!

Unfortunately, this was not the main goal of the RMLMapper and this is reflected into the architecture of the RMLMapper such as joins and other operations. To overcome these limitations, we created the RMLStreamer which can handle GBs of data without any problem. It is available as a Docker container as well. If you can run this container, you only need a Kafka client to interact with the data, would that help your use case?

If not, we suggest you keep your original approach of splitting up your CSV files in smaller ones.

no-response[bot] commented 3 years ago

This issue has been automatically closed because there has been no response to our request for more information from the original author. With only the information that is currently in the issue, we don't have enough information to take action. Please reach out if you have or find the answers we need so that we can investigate further.

CyberDaedalus00 commented 3 years ago

@IanEmmons - we've forked this repository and did a modification where, through the use of a particular LogicalSource that we can get input via stdin (on Windows and UN*X) or via a stream. In our case, we've integrated the forked RMLmapper into an customer NiFi RML mapping processor so that it can be used in workflows we've developed.

I'd be happy to share the JAR file and the LogicalSource definition we defined to see if it would meet your needs.

IanEmmons commented 3 years ago

That sounds interesting. Unfortunately, I'm swamped just now and not in a good space to try an entirely new architecture. I will keep the above possibilities in mind for the next time I come up for air.

DylanVanAssche commented 2 years ago

@CyberDaedalus00

I'd be happy to share the JAR file and the LogicalSource definition we defined to see if it would meet your needs.

Feel free to share the Logical Source definition, we are always interesting!

IanEmmons commented 2 years ago

@CyberDaedalus00 I suspect that your stream-based LogicalSource would not really solve the core issue. It's great to hook rmlmapper up to a stream, but if it's going to read the stream all the way to the end before it starts mapping, then it will still consume too much memory for very large CSV files. The point of this request was that rmlmapper could consume one CSV record (or SQL result set row) at a time and translate it, then consume the next row, because the translation of any one row does not depend on other rows. (This same property does not hold for XML sources, because an XPath expression can access any part of the entire XML document.)

no-response[bot] commented 2 years ago

This issue has been automatically closed because there has been no response to our request for more information from the original author. With only the information that is currently in the issue, we don't have enough information to take action. Please reach out if you have or find the answers we need so that we can investigate further.

RMLio / rmlmapper-java

rmlmapper should consume inputs and produce outputs in a streaming fashion whenever possible #126