Credit card CSV extra headers not auto stripped

cjhowald commented 7 years ago

Example headers:

Credit card transactions;;Customer number: xxxx
MasterCard xxxxx
Outstanding/current transactions till: 01/08/2017
Billing date: 01/25/2017
Voucher date;Date of receipt;Reason for payment;Foreign currency;Amount;Exchange rate;Amount;Currency

Expect all but the last row to be automatically removed on import.

ohthehugemanatee commented 7 years ago

Really? How will you figure out which row is the real header?

I just posted an issue in the parent project ( https://github.com/halloffame/ynab-csv/issues/23) asking for a variable where we can select the number of header rows. That allows for files with 0 (just show the first row example data to choose columns), and files with several. We know that n-1=header row.

automatic would be awesome, but I can't imagine how to do it without big assumptions.

I wanted to submit a patch, but the project doesn't build on Linux (unsupported platform for fsevents, apparently). Thanks for all your work!

cjhowald commented 7 years ago

@ohthehugemanatee I made one assumption: the CSV will not parse properly if an incorrect header row is used. Then, the trick is just defining what "properly" means and extracting that info from the parse result. Turns out, Papaparse does a nice job of recording errors it encounters while parsing. For our purposes, I'm defining parse failure as the first row of data throwing a FieldMismatch error (too many or too few values). This could be too strict for some, but I've found DB always sets the right number of values for data rows. Extraneous header rows, on the other hand, have far fewer "fields", so the parsing will definitely fail according to my criteria. Then, it's just a matter or detecting the failure, stripping off the first row (the bad header) and trying again. This is how I fixed #3.

For this issue specifically, it appears my failure criteria are actually not strict enough. If 2 consecutive "extra" header rows have the same number of "fields", then it will actually register as a success and continue parsing. Rows 2 and 3 in my sample data each have just one field (no delimiters), which causes this to happen. I've just solved this by also imposing a rule that there must be at least 3 fields (date, amount, payee). There are still cases where a false positive success is possible (e.g. 2 "extra" header rows have 4 fields each), but this is sufficiently unlikely for me and does not appear in any of my imports in any case.

Perhaps this approach feels too fragile for you (or the parent maintainer), and I agree a user-specified header number is generally more robust. However, it works well for me and prevents an extra step from the user.

I hope this answers your questions.

cjhowald / ynab-csv

Credit card CSV extra headers not auto stripped #9