New method to parse CSV straight to query or array of arrays for performance on large files

bdw429s commented 1 year ago

I found a few major inefficiencies in the code that maybe can get fixed in a new release. The biggest was that all the data gets read into memory basically 4 times at once.

The entire CSV is read into a string instead of using the Java methods to stream the file line by line
Then the java creates objects that represent the data
Then the CF code creates arrays of arrays to represent the data
Then the CF code populates a query object to represent the data

[section moved to separate issue]

We can reduce those 4 bullets above to just the outer 2 by

streaming the file line by line instead of reading the entire file front. This is mostly for memory savings.
directly creating the query object, adding each row as we go

I would create a new method for this as in doing so, we lose the ability to know

the max columns ahead of time
detecting the column types ahead of time

Here's what it looks like to directly build out a query object as we go, defaulting all column types to varchar and assuming all rows will have the same column count:

    query function parseFromFileToQuery( required string path, required any format ){
        getFileHelper()
            .throwErrorIFfileNotExists( arguments.path )
            .throwErrorIFnotCsvOrTextFile( arguments.path );

        try {
            var parser = library().createJavaObject( "org.apache.commons.csv.CSVParser" )
                .parse(
                    createObject( 'java', 'java.io.File' ).init( arguments.path ),
                    createObject( 'java', 'java.nio.charset.Charset' ).forName( 'UTF-8' ),
                    format
                );
                var it = parser.iterator();
                var colList = arrayToList( it.next().values() );
                var qryResult = queryNew( colList, colList.listMap( (i)=>'varchar' ) );
                while( it.hasNext() ) {
                    qryResult.addRow( it.next().values() );
                }
        } finally {
            parser.close();
        }
        return qryResult;
    }

And this can be made even faster and use less memory on very large files by making a version that skips the query object entirely (which is pretty inefficient to populate with millions of rows in Lucee) and just create raw arrays:

    struct function parseFromFileToQuery( required string path, required any format ){
        getFileHelper()
            .throwErrorIFfileNotExists( arguments.path )
            .throwErrorIFnotCsvOrTextFile( arguments.path );

        try {
            var parser = library().createJavaObject( "org.apache.commons.csv.CSVParser" )
                .parse(
                    createObject( 'java', 'java.io.File' ).init( arguments.path ),
                    createObject( 'java', 'java.nio.charset.Charset' ).forName( 'UTF-8' ),
                    format.withTrailingDelimiter()
                );
                var it = parser.iterator();
                // make optional if there are no column headers
                var colList = arrayToList( it.next().values() );
                var result = {
                    "columns": colList,
                    "data" : []
                };
                while( it.hasNext() ) {
                    result.data.append( it.next().values() );
                }
        } finally {
            parser.close();
        }
        return result;
    }

In my 600Meg 6 million row test file, the query version takes around 2 minutes, but the raw arrays only take around 40 seconds on Lucee 5.4.

cfsimplicity commented 1 year ago

@bdw429s I've had a first go at implementing this with readCsv(). See the tests for how it works.

For now I've just stuck to returning an array (of arrays), or nothing. Queries are problematic as you say, especially cross-engine, and that's largely why csvToQuery() is so poor in its currently implementation.

Please give this a go and let me know where it could be improved. Thanks!

bdw429s commented 1 year ago

@cfsimplicity I just reviewed the new readCsv() stuff and it looks really good! I'll test it here in a bit, but the only thing I'd ask to changes at this point is that it is still really convenient to have the first row read as headers automatically when needed. That allows me to simply loop over the data without need to skip anything or slice up the array and a quick data.len() represents the actual length of the data.

Would you consider flag to mark the first row as headers and change the return format to something like

{
  "columns" : [ "col1", "col2" ],
  "data" : [
    [ "val1", "val2" ],
    [ "val3", "val4" ]
  ]
}

Another huge benifit of this, is this is the default representation of a query object when JSON serialized in CFML which provides people with a dead easy path to convert their data into a proper query if they want like so:

deserializeJSON( jsonRepresentationOfCSV, false )

When there is no header row, the columns array could simply be empty, or you could provide two execute methods, one for headers and one without so the one without always just returns the data directly.

cfsimplicity commented 1 year ago

@bdw429s Commons CSV does support skipping the header row so you could do that using the withHeader() and withSkipHeaderRecord() options. But I agree it's a common need and should be easy to do with a single flag so I've added .withFirstRowIsHeader()

https://github.com/cfsimplicity/spreadsheet-cfml/blob/708465bf32a3ba083ce661be0bf09a1bbe4a0b89/test/specs/readCsv.cfm#L51

This also handles cases where you are skipping rows at the top but still want to auto-detect the header from the first non-skipped row.

https://github.com/cfsimplicity/spreadsheet-cfml/blob/708465bf32a3ba083ce661be0bf09a1bbe4a0b89/test/specs/readCsv.cfm#L62

Excellent idea regarding making the return format DeserializeJson() friendly. Done. I've gone for consistency so you will always get a struct, but the columns value will be empty if no headers are specified/detected.

bdw429s commented 1 year ago

Thanks!

JamoCA commented 1 year ago

When returning structs that may be serialized to json (or dumped), I've started using case-insensitive ordered structs so that I could prioritize certain keys at the top of the output for debugging purposes. (Ordered structs are supported in ACF2016+.) This approach makes debugging easier. I haven't noticed any performance difference in any of our projects or libraries. (Without using an ordered struct, I'm not sure how key order is determined; alphabetic or random? I believe that it changes depending upon which CFML platform/version and function being used.)

Are you able to use ordered structs so that "columns" is always explicitly the first key?

var result = [
    "columns": colList,
    "data" : []
];

cfsimplicity commented 1 year ago

@JamoCA Good point James: https://github.com/cfsimplicity/spreadsheet-cfml/commit/d1db0e55c2f3c01a516f6f48ef0ed3b718bd52dc

cfsimplicity / spreadsheet-cfml

New method to parse CSV straight to query or array of arrays for performance on large files #340