egh / ledger-autosync

Synchronize your ledger-cli files with your bank.
GNU General Public License v3.0
273 stars 42 forks source link

Doesn't strip the BOM from paypal csv downloads #78

Open hollyhudson opened 5 years ago

hollyhudson commented 5 years ago

I'm seeing this in both the pip-downloaded version and when I install from a clone of the github repo. Here's what the error looks like:

geode❀ ~/Nextcloud/ledger/personal/unprocessed🐟
ledger-autosync \
        -a 'Assets:Liquid:PayPal' \
        -l main.ledger \
        /Users/me/Nextcloud/ledger/personal/unprocessed/bom-test.csv \
Traceback (most recent call last):
  File "/usr/local/bin/ledger-autosync", line 11, in <module>
    load_entry_point('ledger-autosync==1.0.1', 'console_scripts', 'ledger-autosync')()
  File "/usr/local/lib/python3.7/site-packages/ledger_autosync-1.0.1-py3.7.egg/ledgerautosync/cli.py", line 353, in run
  File "/usr/local/lib/python3.7/site-packages/ledger_autosync-1.0.1-py3.7.egg/ledgerautosync/cli.py", line 187, in import_csv
  File "/usr/local/lib/python3.7/site-packages/ledger_autosync-1.0.1-py3.7.egg/ledgerautosync/sync.py", line 203, in parse_file
  File "/usr/local/lib/python3.7/site-packages/ledger_autosync-1.0.1-py3.7.egg/ledgerautosync/converter.py", line 524, in make_converter
Exception: Cannot determine CSV type
geode❀ ~/Nextcloud/ledger/personal/unprocessed🐟

And here's what a friend added to my Makefile to fix it:

BOM := $(shell bash -c "echo -e '\xEF\xBB\xBF'")
$(INPUTDIR)/%-nobom.csv: $(INPUTDIR)/%.csv
    sed '1s/^$(BOM)//' < $< \
    | grep -v '"Order",".*",".*","","","0.00"' \
    > $@

(Sorry, that's a bit messy, the extra stuff is in there for completion, so there's context if anyone wants to duplicate it.) The sed statement is the important part -- it searches for the BOM on line one of the file and replaces it with nothing. INPUTDIR is the path to my downloaded, unprocessed files, and we're making a cleaned up paypal-nobom.csv file that is dependent on the downloaded paypal.csv file. The grep line is to deal with another problem where ledger-autosync would choke whenever it encountered one of the numerous lines with no value in the amount field (why, paypal??), so we've added that as part of the BOM-removal process.

egh commented 5 years ago

Thanks for this! It's supposed to strip the BOM, see https://github.com/egh/ledger-autosync/blob/master/ledgerautosync/sync.py#L186. Maybe the BOM is appearing somewhere besides the absolute beginning of the file, or perhaps my code is wrong? Can you send me a 2 line paypal.csv with the data anonymized so I can replicate? Thanks.

colindean commented 4 years ago

For a one-off or alternative method, running dos2unix on the PayPal export suffices to remove the BOM.

torfjelde commented 8 months ago

A bit late to the party, but this breaks on Python 3.10 because csv.DictReader expects f to be opened as text while the f.read(3) == codecs.BOM_UTF8 comparison only works if the file is opened as bytes. This means that the BOM is not stripped, while csv.reader (which is used in csv.DictReader) seems to read the f in bytes form (it's implemented in C, so haven't looked at the code of this), which then causes the BOM to be included in the fieldnames :confused: