Closed bdw429s closed 1 year ago
Here is a simple version that passes a struct to each UDF call. It seems to perform about as well as passing the array, but would be easier to use. I'm also showing passing the current row number to each call as well,
void function ExecuteClosurePerRow( required string path, required any format, required function udf ){
getFileHelper()
.throwErrorIFfileNotExists( arguments.path )
.throwErrorIFnotCsvOrTextFile( arguments.path );
try {
var parser = library().createJavaObject( "org.apache.commons.csv.CSVParser" )
.parse(
createObject( 'java', 'java.io.File' ).init( arguments.path ),
createObject( 'java', 'java.nio.charset.Charset' ).forName( 'UTF-8' ),
format
);
var it = parser.iterator();
// make optional if there are no column headers
var colList = it.next().values();
var numCols = arrayLen( colList );
while( it.hasNext() ) {
var row = {};
var values = it.next().values();
for( var i=1; i <= numCols; i++ ){
row[ colList[ i ] ] = values[i];
}
udf( row, record.getRecordNumber() )
}
} finally {
parser.close();
}
}
Now, extra credit would be to create a pool of threads to process these in parallel :) I can show you how to easily do this with the parallel flag of array.each()
:)
It seems to perform about as well as passing the array,
Nope, I need to take that back! I guess I didn't test it correctly but under no circumstances should we loop over each value on a massive CSV, lol. I go right back to the CPU thrashing OOM errors when I do that. For massive files, the only sustainable approach is to pass the native array directly back and make the user dereference it themselves.
A potential issue here is that, once again, the row .values()
passed to the UDF are a native java arrays. This means member array functions and possibly other operations in the UDF are likely to fail (even on Lucee in my testing). In other words, care will have to be taken with how the UDF is written.
the row .values() passed to the UDF are a native java arrays. This means member array functions and possibly other operations in the UDF are likely to fail (even on Lucee in my testing)
Interesting-- I have yet to come across a scenario in which Lucee fails with a native Java array. Obviously, member functions don't work, but any arrayFunction just wraps it in a "real" Lucee Array behind the scenes from what I've seen. I actually wish Lucee would just use it directly (which is possible to do) but I'm sure wrapping it is simpler for Lucee behind the scenes.
Yes, it's just member functions that fail in Lucee, e.g. values.map()
. But ArrayMap()
works with the native array in both Lucee and ACF. Just something devs will need to watch out for when passing closures. I tend to use member functions by default unless I hit an issue.
Right then, we're on the same page. Sadly, I've stopped using member function by default in many cases-- specifically for simple values with the issues both CF engines have had. Stuff like
simpleValue.len()
wouldn't work in the past if simpleValue
was something like a boolean. Lucee and Adobe have improved in this area, but it's still not 100% (as the arrays show). I'm quite careful now to only use member functions when I know for certain the source of the incoming variable. The same issue exists when using Java Maps directly, which can be passed to a function argument of type Struct
, but still don't have struct member functions.
The current row number will also be passed into the UDF as the second parameter. https://github.com/cfsimplicity/spreadsheet-cfml/blob/1908d084fe4fa440d2ce4a687033866a4971e882/test/specs/readCsv.cfm#L137
Just spitballing here, but if I have a massive file and I need to run some logic for each row, but I DON'T need any data structure created that represents all the data at once, make a method that accepts a closure and invokes it once for every row in the CSV file for row-by-row pricessing. In theory, the file could be as many gigs as you want but we'd never have more than one row in memory at a time (save what the garbage collector hadn't removed)
The code could look something like this (untested):
If there is a header row, we could grab the headers first and
This would truly allow for processing of an unlimited number of rows with little memory overhead.