Import 800MB csv file with 200+ columns

Daemach commented 2 years ago

Julian, do you have a method that will stream a large CSV file in a fashion that would allow me to import the data into a database line by line or in chunks? I don't need all the columns- only about 20 of the 200 that exist in the csv file.

cfsimplicity commented 2 years ago

Short answer: no.

But your question has made me realize the csv parsing could be improved. Apache Commons CSV does in fact support record-by-record parsing as well as up-front loading into memory, which is what I've been doing with csvToQuery().

I've switched to record-wise parsing hoping to see a dramatic improvement, but sadly it was only modest: about 15% in terms of speed.

Obviously csvToQuery() still has to generate a query in memory so it wouldn't be able to handle the sort of file you're talking about anyway.

I did try using the Commons CSV library directly to iterate over a large file (100 columns x 10K rows) and it ran for a while without memory errors, but seemed to be very processor intensive and eventually timed out.

For your use case I would look at your database's bulk import features. I've no experience with doing that but I'm pretty sure CSV imports are a common requirement.

cfsimplicity commented 2 years ago

Actually, I've just discovered that the underwhelming performance and timeouts I was seeing is due to the nested column iterator syntax I was using with Commons CSV. Replacing it with a straightforward for-loop over the array of row values seems to dramatically speed things up.

I was able to run csvToQuery() on a 200 column x 10K row CSV (20MB) in under 10 seconds on a Lucee instance with 1GB of heap. It failed on a memory-starved 128MB instance but that's to be expected given that, as I say, it's constructing a query object in memory.

I also tried using Commons CSV directly on that low memory instance to avoid building the query, and I was able to read in the 20MB/200 column CSV and write it out again line by line in about a second!

csvFilePath = "/path/to/your/file.csv";
format = CreateObject( "java", "org.apache.commons.csv.CSVFormat", ExpandPath( "/spreadsheet/lib/commons-csv-1.9.0.jar" ) )[ JavaCast( "string", "RFC4180" ) ];
fileInput = CreateObject( "java", "java.io.FileReader" ).init( csvFilePath );
parser = format.parse( fileInput );
iterator = parser.iterator();
outputFilePath = "/path/to/your/output.txt";
outputFile = FileOpen( outputFilePath, "write" );
while( iterator.hasNext() ){
  record = iterator.next();
  for( column in record.toList() )
    FileWriteLine( outputFile, column );
}

If you are sending the data straight to the DB rather than loading it in memory then this approach may well work for you, although I still think your DB's bulk insert tools are likely to be more efficient than inserting one row at a time.

Daemach commented 2 years ago

Excellent information! In my case, it would be firing off threaded database inserts.

BCP is SQL Server's bulk loader. It is swift, but it is not intelligent regarding text qualifiers. I tried that first.

JamoCA commented 2 years ago

Regarding importing lots of CSV data to MSSQL. I also use BCP and had to import a 256MB CSV file on-demand. The data was primarily numeric and comma-delimited without any text qualifiers. In order to use MSSQL BCP efficiently, I used NirSoft CSVFileView via the command line (using CFExecute) to intelligently convert it from comma-delimited file to tab-delimited. (This process was a lot easier than praying that ACF could process it without choking.) This step took a minute or so to perform, but it was crucial to the automated BCP import procedure that we use. After we had a TSV file, the import only took a second or two to complete.

bdw429s commented 1 year ago

@Daemach I just ran into some similar issues with a 600 meg file having 32 columns. What crashed with out of memory errors after 10+ of CPU thrashing and a full heap, was able to generate a raw array of arrays in about 40 seconds and much less heap usage with a few changes. Namely

parse file line by line instead of reading it all at once into memory
completely eliminate the inner value loop for each row and just use the native java array directly
Skip intermediate data structures so the data is only in memory once (instead of creating arrays and then populating a query object from the arrays)
avoid a query object entirely as Lucee has major performance overhead when it comes to adding millions of rows to a query object

bdw429s commented 1 year ago

Related ticket with some test code I tried https://github.com/cfsimplicity/spreadsheet-cfml/issues/340

cfsimplicity commented 1 year ago

@Daemach Would you be able to try out the new readCsv() method that Brad has helped with. It's focused on solving the performance issues you and he have mentioned. Currently on the develop branch. The tests show you how to use it. Thanks.

https://github.com/cfsimplicity/spreadsheet-cfml/blob/develop/test/specs/readCsv.cfm

Daemach commented 9 months ago

I'm sorry about the late reply. I did try this, and it works great. Thank you!

cfsimplicity / spreadsheet-cfml

Import 800MB csv file with 200+ columns #303