streamsx.fastcsv: reading files fast

hildrum commented 8 years ago

I'd like to contribute a fast file reader for CSV files. It only works over primitive data types, but it is about five or six times faster than the standard toolkit FileSource. It has approximately 20 tests associated with it.

The operator name is CSVFileSource. The namespace could be either com.ibm.streamsx.fastcsv or com.ibm.streamsx.adapters.fastcsv.

I also have a companion CSVFileSink, but it's only about 70% faster than the standard FileSink and so many not be as useful.

I would like to make this available for public use, but I won't be able to maintain it.

leongor commented 8 years ago

+1

hildrum commented 8 years ago

I've created a repository under my own account, streamsx.fastcsv so you can see what is being offered. If IBMStreams decides to take it, it can either be forked into IBMStreams, or I can put those files in a new repository under IBMStreams directly.

ddebrunner commented 8 years ago

Can the decoding be separated from the file reading? Then the code could go into streamsx.transform

hildrum commented 8 years ago

@ddebrunner Potentially, yes, but in my experiments you lose quite a bit of throughput doing that, even when the operators are fused.

hildrum commented 8 years ago

Alternate proposal: Move the fastcsv parse to streamsx.transform. Since it doesn't transform its input, it's not a perfect match for streamsx.transform, but that seems a better option than creating a new repository.

leongor commented 8 years ago

As I understand CSVFileSource performs 2 tasks: to read a file and to parse it as csv format. Did you try to measure where is the performance benefit gained mostly, is it in a reading or in a parsing part?

hildrum commented 8 years ago

@leongor It saves time in the parsing by using routines like strtod instead of myfile >> adouble. The current version actually uses the same file-reading mechanism as the standard file source.

I used a more direct mechanism to read the file in an earlier version, and it was a little faster IIRC, but it didn't handle compression, and so that's why I use what's there now.

leongor commented 8 years ago

If it's only about parsing part then AdaptiveParser handles CSV format very effectively. It's based boost::spirit - one of the fastest C++ parser frameworks.

hildrum commented 8 years ago

I'd be interesting to do a performance comparison.

In my experiments, there's a substantial performance hit separating the file reading from the parsing. The best thing for performance might be to make a file source based on the AdaptiveParser.

hildrum commented 8 years ago

Closing this issue, and will contribute CSVFileSource to streamsx.transform. If adaptive parser is as fast or faster, it can be removed.

ddebrunner commented 8 years ago

Move the fastcsv parse to streamsx.transform. Since it doesn't transform its input

It does transform, from a line of text to an SPL tuple (I assume).

I think it should just be a csv parser in streamsx.transform, not a file source as well. If there's a performance issue in pushing data from FileSource to an downstream parser then that needs to be addressed. Having multiple file sources potentially leads to confusion and inconsistenc, especially if they have different levels of support for consistent region etc.

chanskw commented 8 years ago

@hildrum I think this is an interesting and useful contribution here.

Looking at the discussions, I think there is some more work / experiments that we should do around the contribution, before we can decide where this should go. For example, how much of a penalty do we have to pay by splitting out the parsing and reading the file? AdapterParser can perform something similar, so how do the performance compare between the two?

I propose that we create a new repository called streamsx.incubationhub. The goal of the project is to take in contributions like this where we may not be ready to spin off a new repository, but provide a place for developers to innovate, experiment and share ideas.

When we think we are ready, then we can take the code from the incubation hub and spin off a new project, or integrate it with an existing project.

Thoughts? Thanks!

petenicholls commented 8 years ago

I think having an incubation repository or something along those lines where prototype operators, or operators written for a POC, or very task specific operators can be shared would be useful. I had discussed putting this particular operator in the samples (as a sample of faster CSV file source operator) but it is not really a sample it is more of experimental operator focused around a faster ingest of CSV files. I tihnk as Streams progresses we may see more and more of these very specific one off type operators people create that could be the basis for a future operator/toolkit but are not quite fully developed yet.

+1 to streamsx.incubationhub (although I think there might be a better name for this I just can't come up with it right now)

hildrum commented 8 years ago

@ddebrunner I can't separate it out into a parser before tomorrow. After tomorrow, I won't have the ability to develop on Streams.

As far as the efficiency is concerned, this post shows that FileSource is less than half the time of FileSource + Parse. That's not precisely the setup that needs to be tested here (because it does a tuple copy between the FileSource and the Parse) but it does suggest that there would be a performance gap between FileSource and FileSource + Parse even without the copy. I don't know if you count that as a performance problem or not.

leongor commented 8 years ago

+1 to streamsx.incubationhub

ddebrunner commented 8 years ago

@hildrum Not sure there's a rush to put it into streamsx.transform as a single operator, will it remain available in your fork, then maybe someone else will have the itch to separate it.

Or maybe you leave it as a pull request against streamsx.transform.

Sorry to hear you won't be able to contribute going forward.

brandtol commented 7 years ago

Performance improvements of the fastcsv and the OCL csv parser (based on Kris and Rasheeds work) have been incorporated into the CSVParse operator in the TEDA toolkit, which is much faster than the stock SPL FileSource in csv mode. So I would recommend to close this issue.

https://www.ibm.com/support/knowledgecenter/SSCRJU_4.2.0/com.ibm.streams.toolkits.doc/spldoc/dita/tk$com.ibm.streams.teda/op$com.ibm.streams.teda.parser.text$CSVParse.html

hildrum commented 7 years ago

Glad to hear that made it in. I'm closing.

IBMStreams / administration

streamsx.fastcsv: reading files fast #76