This is the ETL lib package. It provides an API to munge and prepare JSON, TSV and other data using Apache Tika and JSON parsing/loading for ETL via Apache OODT (or other libs) into Apache Solr.
16
stars
35
forks
source link
possible bug - tsvtojson prints each unique record twice?? #25
We tried running tsvtojson.py with -u ["unique filed option set] , it prints all the unique records twice in the outputfile. This becuase each unique reocrd is added to the list twice [ Line numbers 178, 180 in tsvtojson.py ]
if not jsonStruct[uniqueField] in fieldCache:
jsonStructs.append(jsonStruct) // line 178
fieldCache[jsonStruct[uniqueField]] = "yes"
jsonStructs.append(jsonStruct) // line 180
Hi Professor,
We tried running tsvtojson.py with -u ["unique filed option set] , it prints all the unique records twice in the outputfile. This becuase each unique reocrd is added to the list twice [ Line numbers 178, 180 in tsvtojson.py ]
if not jsonStruct[uniqueField] in fieldCache: jsonStructs.append(jsonStruct) // line 178 fieldCache[jsonStruct[uniqueField]] = "yes" jsonStructs.append(jsonStruct) // line 180
Thanks Srikanth