cdaller / multi_anonymizer

Anonymize connected data in multiple csv or xml files
Apache License 2.0
17 stars 4 forks source link

Readme should mention delimiter #3

Open ismith opened 5 years ago

ismith commented 5 years ago

I assumed a csv tool would default to ',' as the delimiter, but this defaults to ';'. This is mentioned in the terminal help, but not the readme, and the failure mode is an IndexError, which is confusing:

Traceback (most recent call last):
  File "./csv_anonymizer.py", line 152, in <module>
    anonymize(source, target, column_index, int(ARGS.headerLines), ARGS.encoding, delimiter)
  File "./csv_anonymizer.py", line 96, in anonymize
    for row in anonymize_rows(reader, columnIndex):
  File "./csv_anonymizer.py", line 74, in anonymize_rows
    if len(row[columnIndex].strip()) > 0:
IndexError: list index out of range

(Because csv_anonymizer assumes ';', and your csv uses ',', csv_anonymizer is likely to treat a row as a single large field, and thus fail if given a columnIndex > 0.)

In addition to mentioning --delimiter in the readme, perhaps a try/except around that error, with a user-friendly warning if len(row) == 1?

ismith commented 5 years ago

(Once I figured out that I needed to specify --delimiter=,, though, this tool saved me a ton of time, so thank you!)

cdaller commented 5 years ago

Thanks for the feedback - I added some information about default values (and mentioned the default delimiter).

Glad my script was helpful for you!

cdaller commented 5 years ago

Just thought about your idea about adding a user friendly message, if the delimiter is wrong and the index given is therefore out of bounds:

What do you think?

ismith commented 5 years ago
  • Sometimes I only have one column, so this is fine and should not result in an error. Any messages written in this case would be included in my csv file when writing to stdout!

  • A check of the column index to be less than len(row) (line 74) might solve the problem of variable row length (some rows do have x columns, some not), but might be a problem in the case you describe: wrong delimiter, script only sees 1 column, but you want to replace column 2.

print("spam", file=sys.stderr) would avoid that (writing to the file)! But yes, I agree that you don't want to error just on len(row) == 1. Making this part of the IndexError handling limits it to the case where you've specified a column that is out of bounds.

That said - right now, the behavior if you specify an index that the row doesn't have is a crash (due to unhandled error). So maybe the right thing is just to do something like

try:
  [code that raises error, this is line 74 in csv_anonymizer.py]
except IndexError:
  msg = "You wanted to anonymize field {}, but row {} has only {} fields. Possibly your data is missing fields, or '{}' is the wrong delimiter?"
  raise Exception(msg.format(columnIndex, rowIndex, len(row), delimiter))

(My python is a bit rusty, I'm not 100% sure of the syntax, but this should be close.)

Note that this would require passing delimiter as a param to anonymize_rows, and you'd want to replace for row in rows: with something like for rowIndex, row in enumerate(rows): (See: https://treyhunner.com/2016/04/how-to-loop-with-indexes-in-python/#What_if_we_need_indexes?)