Closed hildrum closed 7 years ago
+1
I've created a repository under my own account, streamsx.fastcsv so you can see what is being offered. If IBMStreams decides to take it, it can either be forked into IBMStreams, or I can put those files in a new repository under IBMStreams directly.
Can the decoding be separated from the file reading? Then the code could go into streamsx.transform
@ddebrunner Potentially, yes, but in my experiments you lose quite a bit of throughput doing that, even when the operators are fused.
Alternate proposal: Move the fastcsv parse to streamsx.transform
. Since it doesn't transform its input, it's not a perfect match for streamsx.transform
, but that seems a better option than creating a new repository.
As I understand CSVFileSource performs 2 tasks: to read a file and to parse it as csv format. Did you try to measure where is the performance benefit gained mostly, is it in a reading or in a parsing part?
@leongor It saves time in the parsing by using routines like strtod
instead of myfile >> adouble
. The current version actually uses the same file-reading mechanism as the standard file source.
I used a more direct mechanism to read the file in an earlier version, and it was a little faster IIRC, but it didn't handle compression, and so that's why I use what's there now.
If it's only about parsing part then AdaptiveParser handles CSV format very effectively. It's based boost::spirit - one of the fastest C++ parser frameworks.
I'd be interesting to do a performance comparison.
In my experiments, there's a substantial performance hit separating the file reading from the parsing. The best thing for performance might be to make a file source based on the AdaptiveParser.
Closing this issue, and will contribute CSVFileSource
to streamsx.transform
. If adaptive parser is as fast or faster, it can be removed.
Move the fastcsv parse to streamsx.transform. Since it doesn't transform its input
It does transform, from a line of text to an SPL tuple (I assume).
I think it should just be a csv parser in streamsx.transform, not a file source as well. If there's a performance issue in pushing data from FileSource to an downstream parser then that needs to be addressed. Having multiple file sources potentially leads to confusion and inconsistenc, especially if they have different levels of support for consistent region etc.
@hildrum I think this is an interesting and useful contribution here.
Looking at the discussions, I think there is some more work / experiments that we should do around the contribution, before we can decide where this should go. For example, how much of a penalty do we have to pay by splitting out the parsing and reading the file? AdapterParser can perform something similar, so how do the performance compare between the two?
I propose that we create a new repository called streamsx.incubationhub. The goal of the project is to take in contributions like this where we may not be ready to spin off a new repository, but provide a place for developers to innovate, experiment and share ideas.
When we think we are ready, then we can take the code from the incubation hub and spin off a new project, or integrate it with an existing project.
Thoughts? Thanks!
I think having an incubation repository or something along those lines where prototype operators, or operators written for a POC, or very task specific operators can be shared would be useful. I had discussed putting this particular operator in the samples (as a sample of faster CSV file source operator) but it is not really a sample it is more of experimental operator focused around a faster ingest of CSV files. I tihnk as Streams progresses we may see more and more of these very specific one off type operators people create that could be the basis for a future operator/toolkit but are not quite fully developed yet.
+1 to streamsx.incubationhub (although I think there might be a better name for this I just can't come up with it right now)
@ddebrunner I can't separate it out into a parser before tomorrow. After tomorrow, I won't have the ability to develop on Streams.
As far as the efficiency is concerned, this post shows that FileSource
is less than half the time of FileSource
+ Parse
. That's not precisely the setup that needs to be tested here (because it does a tuple copy between the FileSource
and the Parse
) but it does suggest that there would be a performance gap between FileSource
and FileSource
+ Parse
even without the copy. I don't know if you count that as a performance problem or not.
+1 to streamsx.incubationhub
@hildrum Not sure there's a rush to put it into streamsx.transform as a single operator, will it remain available in your fork, then maybe someone else will have the itch to separate it.
Or maybe you leave it as a pull request against streamsx.transform.
Sorry to hear you won't be able to contribute going forward.
Performance improvements of the fastcsv and the OCL csv parser (based on Kris and Rasheeds work) have been incorporated into the CSVParse operator in the TEDA toolkit, which is much faster than the stock SPL FileSource in csv mode. So I would recommend to close this issue.
Glad to hear that made it in. I'm closing.
I'd like to contribute a fast file reader for CSV files. It only works over primitive data types, but it is about five or six times faster than the standard toolkit
FileSource
. It has approximately 20 tests associated with it.The operator name is
CSVFileSource
. The namespace could be eithercom.ibm.streamsx.fastcsv
orcom.ibm.streamsx.adapters.fastcsv
.I also have a companion
CSVFileSink
, but it's only about 70% faster than the standardFileSink
and so many not be as useful.I would like to make this available for public use, but I won't be able to maintain it.