chrismattmann / etllib

This is the ETL lib package. It provides an API to munge and prepare JSON, TSV and other data using Apache Tika and JSON parsing/loading for ETL via Apache OODT (or other libs) into Apache Solr.
16 stars 35 forks source link

possible bug - tsvtojson prints each unique record twice?? #25

Closed gsrika closed 10 years ago

gsrika commented 10 years ago

Hi Professor,

We tried running tsvtojson.py with -u ["unique filed option set] , it prints all the unique records twice in the outputfile. This becuase each unique reocrd is added to the list twice [ Line numbers 178, 180 in tsvtojson.py ]

if not jsonStruct[uniqueField] in fieldCache: jsonStructs.append(jsonStruct) // line 178 fieldCache[jsonStruct[uniqueField]] = "yes" jsonStructs.append(jsonStruct) // line 180

Thanks Srikanth

chrismattmann commented 10 years ago

Great catch! Can you submit a Pull Request?

chrismattmann commented 10 years ago

Fixed!