cfsimplicity / spreadsheet-cfml

Standalone library for working with spreadsheets and CSV in CFML
MIT License
127 stars 36 forks source link

Write csv files #221

Closed Daemach closed 3 years ago

Daemach commented 3 years ago

Can you add an option to write csv files in addition to xls/xlsx?

cfsimplicity commented 3 years ago

I'm a bit reluctant tbh. It's specifically a spreadsheet library rather than a general data conversion tool and I'm concerned about mission creep.

Having said that, I have allowed csvToQuery() which doesn't involve spreadsheet objects, and you're not the first to ask for this.

Let me think about it. Meantime here's some code that will convert a query to csv.

Daemach commented 3 years ago

I understand. I see csv as another spreadsheet format, which is why excel adds itself as the default opener for .csv when it's installed. A csv file is just columns and rows without formatting info or formulas.

Thanks for the code!

On Thu, Feb 18, 2021 at 1:31 PM Julian Halliwell notifications@github.com wrote:

I'm a bit reluctant tbh. It's specifically a spreadsheet library rather than a general data conversion tool and I'm concerned about mission creep.

Having said that, I have allowed csvToQuery() https://github.com/cfsimplicity/lucee-spreadsheet/wiki/csvToQuery which doesn't involve spreadsheet objects, and you're not the first to ask for this.

Let me think about it. Meantime here's some code that will convert a query to csv https://gist.github.com/cfsimplicity/e1b5a4f1db6b54b77163f8f03f88f346.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/cfsimplicity/lucee-spreadsheet/issues/221#issuecomment-781648198, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC7Y5YKIHPBZR5PPSQE52YTS7WBMNANCNFSM4X27HBBA .

JamoCA commented 3 years ago

CSV can be tricky. When opened using Excel, it can really mess with date & numeric values is they aren't formatted and/or escaped correctly. If any values have a carriage return, it will need to be stripped or Excel may not open it correctly. CSV is a loose spec and only simple CSV files can be read by ColdFusion and Excel.

You can have carriage returns in CSV if quoted. You can also have variable column lengths. Some quoted string column data may contain escape quotation marks. We've had some data feeds that are like this and the only solution that we've found to read them correctly was OpenCSV. I haven't used OpenCSV for writing CSV yet, but I should upgrade our JAR and experiment.

cfsimplicity commented 3 years ago

@Daemach You're right, CSV and spreadsheets are closely related and the library does support some CSV functionality already.

In fact there's a downloadCsvFromFile method I'd forgotten about which can export to CSV. This uses the existing ability toread() in a spreadsheet file as CSV.

To achieve that, there's long been a private queryToCsv() method which did a fairly unsophisticated conversion just using CFML. It got round the reserved character issues by double-quoting all values (escaping double-quotes). Obviously that means larger CSV files than necessary.

For reading CSV, the library uses Commons CSV (rather than OpenCSV), which I've found does a pretty good job. @JamoCA I didn't think of using it for writing either, but I've had a go now and it seems to be good at that too. Not only does it handle the escaping etc in a more intelligent way, it also seems a lot faster than my crude writer (I saw a 4x speed increase with 5 columns x 1 million rows).

So queryToCsv() now uses Commons CSV and I've made it public. I've also added writeToCsv( required workbook, required filepath, overwrite=false, delimiter="," ) for that specific use-case.

These are the tests for queryToCsv() I have so far. If there's anything else I should be covering, James, please let me know.

The changes are in the develop branch.

sebgmc commented 3 years ago

Sounds great! Thnx 4 maintaining this Julian! Sebastiaan

Op za 20 feb. 2021 12:02 schreef Julian Halliwell <notifications@github.com

:

@Daemach https://github.com/Daemach You're right, CSV and spreadsheets are closely related and the library does support some CSV functionality already.

In fact there's a downloadCsvFromFile https://github.com/cfsimplicity/lucee-spreadsheet/wiki/downloadCsvFromFilemethod I'd forgotten about which can export to CSV. This uses the existing ability toread() in a spreadsheet file as CSV.

To achieve that, there's long been a private queryToCsv() method which did a fairly unsophisticated conversion just using CFML. It got round the reserved character issues by double-quoting all values (escaping double-quotes). Obviously that means larger CSV files than necessary.

For reading CSV, the library uses Commons CSV (rather than OpenCSV), which I've found does a pretty good job. @JamoCA https://github.com/JamoCA I didn't think of using it for writing either, but I've had a go now and it seems to be good at that too. Not only does it handle the escaping etc in a more intelligent way, it also seems a lot faster than my crude writer (I saw a 4x speed increase with 5 columns x 1 million rows).

So queryToCsv() now uses Commons CSV and I've made it public. I've also added writeToCsv( required workbook, required filepath, overwrite=false, delimiter="," ) for that specific use-case.

These are the tests for queryToCsv() https://github.com/cfsimplicity/lucee-spreadsheet/blob/develop/test/specs/queryToCsv.cfm I have so far. If there's anything else I should be covering, James, please let me know.

The changes are in the develop branch.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/cfsimplicity/lucee-spreadsheet/issues/221#issuecomment-782606942, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7XQMXI5GSPB77H4MIRL5DS76JFBANCNFSM4X27HBBA .

cfsimplicity commented 3 years ago

PS: @Daemach Please could you test the develop branch? And @sebgmc if possible?

Thanks.

EDIT: the signatures for the new methods are:

queryToCsv( required query query, boolean includeHeaderRow=false, string delimiter="," )

writeToCsv( required workbook, required string filepath, boolean overwrite=false, string delimiter="," )
JamoCA commented 3 years ago

I LIKE IT! I was up late last night updating my OpenCSV library from v2.3 to 5.3. There was a lot of breaking changes and I went to bed before I was able to get the "write" function to work.

Here's the result from my tab-delimited MLS feed file with 9 records. (smallish sample)

OpenCSV Read: 437ms csvToQuery: 6ms queryToCsv: 29ms (query that was used was generated using csvtoQuery)

I compared the files and the original file's columheader used mixed-case, but the CFC wrote it using all uppercase. Are you able to preserve the save and loop over queryName.getColumnNames() when generating the CSV data?

One of the tab-delimited rows had quotes added to the beginning/end of a single value while internal quotes within the value were escaped. I'm not sure if this is an issue or not, but is the only data difference I've encountered so far.

Id  Description
1   These "quotes" are retained.

was outputted as:

ID  DESCRIPTION
1   "These ""quotes"" are retained."

After you've added this function, I think I'm going to stop using OpenCSV in favor of solely using the LuceeSpreadsheet CFC. (I had been using both, in addition to a QuerytoCSV UDF that I wrote, to give clients an option to download either format.

cfsimplicity commented 3 years ago

Thanks James! Really pleased it seems to be working well for you.

I might have misunderstood, but I couldn't replicate the column case problem withqueryToCsv. I tried this on both Lucee and ACF2016:

q = QueryNew( "MixedCase1,MixedCase2", "VarChar,VarChar", [ [ "a", "b" ], [ "c", "d" ] ] );
WriteDump( spreadsheet.queryToCsv( q, true, "tab" ) );

I got a mixed case header row in each case:

MixedCase1 MixedCase2 a b c d 

In ACF the column names are retrieved using query.getColumnNames(), (QueryColumnArray() in Lucee).

According to the Commons CSV docs the default "QuoteMode" is MINIMAL, which...

Quotes fields which contain special characters such as a the field delimiter, quote character or any of the characters in the line separator string.

I think this approach is still valid and hopefully won't be a problem for you.

cfsimplicity commented 3 years ago

Ah, I can see the case issue now, James, which happens on ACF only when using csvToQuery(). Part of that process involves generating a new query object, but because of an ACF limitation there's a workaround which involves calling the query.setColumnNames() function which unfortunately seems to change the names to UPPER (see the private method _QueryNew()).

Will try and figure a way round that if possible.

cfsimplicity commented 3 years ago

Unfortunately I can't see a way to avoid query.setColumnNames() completely in ACF, but I have changed _QueryNew() so that it only uses it when it detects an invalid variable name in the column/header names. See #222

To be clear, this isn't a bug with queryToCsv() which will use whatever case the columns in the query you pass happen to have. You only saw it James because you created the query using csvToQuery(), which is where the problem lies

I've created an open ticket #223 in the hope that a solution emerges at some point. Meantime case will now at least be preserved in ACF as long as you make sure your column names are all valid variable names.

JamoCA commented 3 years ago

I believe that you could do this with a query of queries... at least that's what I've had to do to safely rename columns in a query without errors. Creating an ordered struct of the original values, map them to the existing query (["COLUMN1"="Column1"])and then perform a QofQ... SELECT COLUMN1 AS Column1, etc FROM OriginalQuery.

For potentially illegally named columns, use a library like str to optionally enable users to choose how to automatically "slugify" column names. This is what we do whenever generating Excel or CSV files from database or user spreadsheets. If a column header starts with a invalid character (ie, 1), perhaps add a "C" or user-defined prefix to the beginning.

cfsimplicity commented 3 years ago

Thanks for the suggestions, James, but if the aim is to preserve the column/header names as supplied in the CSV then adding a slug or whatever won't do.

I tried QoQ aliasing which works in ACF (not Lucee interestingly) with invalid variable names, except for those containing a space - which is going to be quite common, e.g First name.

Fortunately I've come across an ingenious idea which does seem to work as an alternative to query.setColumns(). You serialize the query to a JSON string, replace the column names, then deserialize it back to a query. Result: column names preserved exactly as they came in from the CSV.

Please give it another go.

cfsimplicity commented 3 years ago

PS: I tried csvToQuery() with a 100,000 row, 5 column CSV file with spaces in the column names, and there was no performance hit - round about 5 seconds for all the engines I tested: Lucee, ACF 2016, 2018 and 2021.

JamoCA commented 3 years ago

The serialization/deserialization workaround works perfectly and the query column names matched the CSV column names.

Thanks for the info regarding the Lucee QofQ aliasing issue. I'm going to have to review this as I have some third-party import processes that remap column names and this is going to be a problem if this (and other things) don't work as I've been experiencing with ACF for many years,

cfsimplicity commented 3 years ago

Yes I think Lucee uses a different engine for QoQ.

Thanks for testing the CSV stuff.

@Daemach can you confirm it's doing what you want? If so I'll sort out a new release.

JamoCA commented 3 years ago

What method were you using to perform a QofQ? I just tested this on TryCF.com and it converts upper-case column names to my desired case using Lucee, Railo and CF11-2021. (NOTE: queryExecute only works with CF11+.)

<cfscript>
Q = QueryNew("ID,NAME,TEST", "integer,varchar,varchar", [{id=1,name="John Doe",test="Hello world"}]);
writedump(var=Q.getColumnNames());
writedump(var=Q);

Q1 = queryexecute("SELECT ID AS id, NAME AS Name, TEST as Test FROM Q", {}, {dbtype="query"});
writedump(var=Q1.getColumnNames());
writedump(var=Q1);

Q2 = queryexecute("SELECT ID AS iD, NAME AS NaMe, TEST as TesT FROM Q", {}, {dbtype="query"});
writedump(var=Q2.getColumnNames());
writedump(var=Q2);
</cfscript>
cfsimplicity commented 3 years ago

Yes QoQ is fine for handling case but not column/header names that include spaces. Try using First name as an alias.

JamoCA commented 3 years ago

I tried to write a test to create an invalid column name and continually encountered the error The column name FULL NAME is invalid. in both ACF & Lucee. What's interesting is that original queries from a datasource can contain this column, but both platforms complain when you attempt to manually create a column with spaces.

QueryAddColumn(Q, "FULL NAME", "varchar");

Many years ago, we used a C++ CFX tag to import Excel XLS data. Their default approach to invalid column names was to automatically relabel them to COLUMNx (where x is the column number.) I also have a client that also adds narrative comments as extra headers. Since it's not a column that we expect, it's ignored. We've also ran into problems where duplicate column names exist and you can see both columns when you dump, but you can only access the last column's value.

cfsimplicity commented 3 years ago

Lucee will accept column names with spaces if you use QueryNew():

q = QueryNew( "First name,Last name", "VarChar,VarChar", [ [ "Frumpo", "McNugget" ], [ "Susi", "Sorglos" ] ] );
WriteDump( q.getColumnNames() );

ACF won't, hence the need for a workaround.

I'm pleased that we can now improve on that CFX tag!

cfsimplicity commented 3 years ago

PS: Lucee's QueryNew() also allows you to pass the column names as an array instead of a list, meaning they can include commas:

q = QueryNew( [ "First name", "Last, name" ], "VarChar,VarChar", [ [ "Frumpo", "McNugget" ], [ "Susi", "Sorglos" ] ] );
JamoCA commented 3 years ago

ACF won't manually create query columns with spaces? That seems like another bug... especially after 7 years it was brought up on StackExchange and Lucee/non-QofQ's don't throw errors. (I've reported it... CF-4211146.) I don't think anyone from Adobe really monitors anything outside of paid support or extremely obvious show-stopping bugs that impact all users.

The following SQL generates a column name with a space when using QueryExecute or CFQuery tag with CF2016-2021 and MSSQL Server.

SELECT TOP 1 ID, ID AS 'Column with spaces' FROM My_MSSQLDatabase_Table;

PS: Lucee's QueryNew() also allows you to pass the column names as an array instead of a list, meaning they can include commas:

While I really like that syntax, I don't have the luxury of releasing exclusive Lucee-only cfml. I'm attempting to appeal to the largest audience possible and need all CFML functions to be cross-platform compatible.

cfsimplicity commented 3 years ago

ACF won't manually create query columns with spaces?

I tried ACF2021 and still no joy

The following SQL generates a column name with a space when using QueryExecute or CFQuery...

Yes, regular SQL supports quoted aliases but QoQ doesn't in either engine.

cfsimplicity commented 3 years ago

Thanks for the info regarding the Lucee QofQ aliasing issue

@JamoCA It seems I was mistaken: Lucee QoQ does support aliases that are either invalid variable names or contain spaces. You just need to escape them with square brackets:

QueryExecute( "SELECT firstname AS [1st Name] FROM myOriginalQuery", {}, { dbtype: "query" } );

You can also reference original query names that contain spaces using brackets (which you can't in ACF):

QueryExecute( "SELECT [First name] FROM myOriginalQuery", {}, { dbtype: "query" } );
JamoCA commented 3 years ago

Thanks.... I tried that with ACF and didn't test any further because it wasn't a cross-compatible solution.

I've also submitted these valid SQL samples (for the sake of platform comparison) to the ACF bug report.

cfsimplicity commented 3 years ago

Released as v2.15.0