cfsimplicity / spreadsheet-cfml

Standalone library for working with spreadsheets and CSV in CFML
MIT License
126 stars 35 forks source link

Reading CSV: New method provide listener UDF that runs once for each row #341

Closed bdw429s closed 10 months ago

bdw429s commented 10 months ago

Just spitballing here, but if I have a massive file and I need to run some logic for each row, but I DON'T need any data structure created that represents all the data at once, make a method that accepts a closure and invokes it once for every row in the CSV file for row-by-row pricessing. In theory, the file could be as many gigs as you want but we'd never have more than one row in memory at a time (save what the garbage collector hadn't removed)

The code could look something like this (untested):


    void function ExecuteClosurePerRow( required string path, required any format, required function udf ){
        getFileHelper()
            .throwErrorIFfileNotExists( arguments.path )
            .throwErrorIFnotCsvOrTextFile( arguments.path );

        try {
            var parser = library().createJavaObject( "org.apache.commons.csv.CSVParser" )
                .parse(
                    createObject( 'java', 'java.io.File' ).init( arguments.path ),
                    createObject( 'java', 'java.nio.charset.Charset' ).forName( 'UTF-8' ),
                    format
                );
                var it = parser.iterator();
                while( it.hasNext() ) {
                    udf( it.next().values() )
                }
        } finally {
            parser.close();
        }
    }

If there is a header row, we could grab the headers first and

This would truly allow for processing of an unlimited number of rows with little memory overhead.

bdw429s commented 10 months ago

Here is a simple version that passes a struct to each UDF call. It seems to perform about as well as passing the array, but would be easier to use. I'm also showing passing the current row number to each call as well,

    void function ExecuteClosurePerRow( required string path, required any format, required function udf ){
        getFileHelper()
            .throwErrorIFfileNotExists( arguments.path )
            .throwErrorIFnotCsvOrTextFile( arguments.path );

        try {
            var parser = library().createJavaObject( "org.apache.commons.csv.CSVParser" )
                .parse(
                    createObject( 'java', 'java.io.File' ).init( arguments.path ),
                    createObject( 'java', 'java.nio.charset.Charset' ).forName( 'UTF-8' ),
                    format
                );
                var it = parser.iterator();
                // make optional if there are no column headers
                var colList = it.next().values();
                var numCols = arrayLen( colList );
                while( it.hasNext() ) {
                    var row = {};
                    var values = it.next().values();
                    for( var i=1; i <= numCols; i++ ){
                        row[ colList[ i ] ] = values[i];
                    }
                    udf( row, record.getRecordNumber() )
                }
        } finally {
            parser.close();
        }
    }

Now, extra credit would be to create a pool of threads to process these in parallel :) I can show you how to easily do this with the parallel flag of array.each() :)

bdw429s commented 10 months ago

It seems to perform about as well as passing the array,

Nope, I need to take that back! I guess I didn't test it correctly but under no circumstances should we loop over each value on a massive CSV, lol. I go right back to the CPU thrashing OOM errors when I do that. For massive files, the only sustainable approach is to pass the native array directly back and make the user dereference it themselves.

cfsimplicity commented 10 months ago

A potential issue here is that, once again, the row .values() passed to the UDF are a native java arrays. This means member array functions and possibly other operations in the UDF are likely to fail (even on Lucee in my testing). In other words, care will have to be taken with how the UDF is written.

cfsimplicity commented 10 months ago

https://github.com/cfsimplicity/spreadsheet-cfml/blob/e0b0f38d72e489de3fb3037f80cd7f14f7723f7b/test/specs/readCsv.cfm#L120

https://github.com/cfsimplicity/spreadsheet-cfml/blob/e0b0f38d72e489de3fb3037f80cd7f14f7723f7b/test/specs/readCsv.cfm#L137

bdw429s commented 10 months ago

the row .values() passed to the UDF are a native java arrays. This means member array functions and possibly other operations in the UDF are likely to fail (even on Lucee in my testing)

Interesting-- I have yet to come across a scenario in which Lucee fails with a native Java array. Obviously, member functions don't work, but any arrayFunction just wraps it in a "real" Lucee Array behind the scenes from what I've seen. I actually wish Lucee would just use it directly (which is possible to do) but I'm sure wrapping it is simpler for Lucee behind the scenes.

cfsimplicity commented 10 months ago

Yes, it's just member functions that fail in Lucee, e.g. values.map(). But ArrayMap() works with the native array in both Lucee and ACF. Just something devs will need to watch out for when passing closures. I tend to use member functions by default unless I hit an issue.

bdw429s commented 10 months ago

Right then, we're on the same page. Sadly, I've stopped using member function by default in many cases-- specifically for simple values with the issues both CF engines have had. Stuff like

simpleValue.len()

wouldn't work in the past if simpleValue was something like a boolean. Lucee and Adobe have improved in this area, but it's still not 100% (as the arrays show). I'm quite careful now to only use member functions when I know for certain the source of the incoming variable. The same issue exists when using Java Maps directly, which can be passed to a function argument of type Struct, but still don't have struct member functions.

cfsimplicity commented 10 months ago

The current row number will also be passed into the UDF as the second parameter. https://github.com/cfsimplicity/spreadsheet-cfml/blob/1908d084fe4fa440d2ce4a687033866a4971e882/test/specs/readCsv.cfm#L137