Wittline / csv-schema-inference

A tool to automatically infer columns data types in .csv files
https://wittline.github.io/csv-schema-inference/
MIT License
33 stars 4 forks source link

Files w/ quoted values that have commas throw excetion #38

Open greghall76 opened 11 months ago

greghall76 commented 11 months ago

Describe the bug File contains quoted numbder "2,126,000,000".... Throws off index alignment between types extracted in headers and data....

File "/home/greg/prj/sdspop/ingest/workflows/schema-on-read/venv/lib/python3.10/site-packages/csv_schema_inference/csv_schema_inference.py", line 397, in run_inference schemas_result = prl.parallel(records = lines,obj=dtype, d_schema = self.__schema) File "/home/greg/prj/sdspop/ingest/workflows/schema-on-read/venv/lib/python3.10/site-packages/csv_schema_inference/csv_schema_inference.py", line 165, in parallel return [p.get() for p in results] File "/home/greg/prj/sdspop/ingest/workflows/schema-on-read/venv/lib/python3.10/site-packages/csv_schema_inference/csv_schema_inference.py", line 165, in return [p.get() for p in results]

To Reproduce Steps to reproduce the behavior:

  1. See example below... "id","country","year","sex","age","suicides_no","population","country-year","HDI for year"," gdp_for_year","gdp_per_capita","generation" 0,"Albania",1987,"male","15-24 years",21,312900,"Albania1987",,"2,156,624,900",796,"Generation X" 1,"Albania",1987,"male","35-54 years",16,308000,"Albania1987",,"2,156,624,900",796,"Silent" 2,"Albania",1987,"female","15-24 years",14,289700,"Albania1987",,"2,156,624,900",796,"Generation X" 3,"Albania",1987,"male","75+ years",1,21800,"Albania1987",,"2,156,624,900",796,"G.I. Generation" 4,"Albania",1987,"male","25-34 years",9,274300,"Albania1987",,"2,156,624,900",796,"Boomers" 5,"Albania",1987,"female","75+ years",1,35600,"Albania1987",,"2,156,624,900",796,"G.I. Generation"

  2. See code below... from multiprocessing import freeze_support, Process from csv_schema_inference import csv_schema_inference

def main():

if the inferred data type is INTEGER and there is a presence of FLOAT on the results , then the result will be FLOAT

conditions = {"INTEGER":"FLOAT"} pathfile = "/home/greg/prj/sdspop/ingest/workflows/schema-on-read/suicide_data.csv"

csv_infer = csv_schema_inference.CsvSchemaInference(portion=0.9, max_length=100, batch_size = 200000, acc = 0.8, seed=2, header=True, sep=",", conditions = conditions) aprox_schema = csv_infer.run_inference(pathfile) csv_infer.pretty(aprox_schema)

if name == 'main': freeze_support() Process(target=main).start()

Expected behavior Should have made it to some kind of schema inference. e.g. 0 name Username; Identifier;One-time password;Recovery code;First name;Last name;Department;Location type STRING nullable False ....

Desktop (please complete the following information):