Describe the bug
File contains quoted numbder "2,126,000,000"....
Throws off index alignment between types extracted in headers and data....
File "/home/greg/prj/sdspop/ingest/workflows/schema-on-read/venv/lib/python3.10/site-packages/csv_schema_inference/csv_schema_inference.py", line 397, in run_inference
schemas_result = prl.parallel(records = lines,obj=dtype, d_schema = self.__schema)
File "/home/greg/prj/sdspop/ingest/workflows/schema-on-read/venv/lib/python3.10/site-packages/csv_schema_inference/csv_schema_inference.py", line 165, in parallel
return [p.get() for p in results]
File "/home/greg/prj/sdspop/ingest/workflows/schema-on-read/venv/lib/python3.10/site-packages/csv_schema_inference/csv_schema_inference.py", line 165, in
return [p.get() for p in results]
To Reproduce
Steps to reproduce the behavior:
See example below...
"id","country","year","sex","age","suicides_no","population","country-year","HDI for year"," gdp_for_year","gdp_per_capita","generation"
0,"Albania",1987,"male","15-24 years",21,312900,"Albania1987",,"2,156,624,900",796,"Generation X"
1,"Albania",1987,"male","35-54 years",16,308000,"Albania1987",,"2,156,624,900",796,"Silent"
2,"Albania",1987,"female","15-24 years",14,289700,"Albania1987",,"2,156,624,900",796,"Generation X"
3,"Albania",1987,"male","75+ years",1,21800,"Albania1987",,"2,156,624,900",796,"G.I. Generation"
4,"Albania",1987,"male","25-34 years",9,274300,"Albania1987",,"2,156,624,900",796,"Boomers"
5,"Albania",1987,"female","75+ years",1,35600,"Albania1987",,"2,156,624,900",796,"G.I. Generation"
See code below...
from multiprocessing import freeze_support, Process
from csv_schema_inference import csv_schema_inference
def main():
if the inferred data type is INTEGER and there is a presence of FLOAT on the results , then the result will be FLOAT
if name == 'main':
freeze_support()
Process(target=main).start()
Expected behavior
Should have made it to some kind of schema inference.
e.g.
0
name
Username; Identifier;One-time password;Recovery code;First name;Last name;Department;Location
type
STRING
nullable
False
....
Desktop (please complete the following information):
Describe the bug File contains quoted numbder "2,126,000,000".... Throws off index alignment between types extracted in headers and data....
File "/home/greg/prj/sdspop/ingest/workflows/schema-on-read/venv/lib/python3.10/site-packages/csv_schema_inference/csv_schema_inference.py", line 397, in run_inference schemas_result = prl.parallel(records = lines,obj=dtype, d_schema = self.__schema) File "/home/greg/prj/sdspop/ingest/workflows/schema-on-read/venv/lib/python3.10/site-packages/csv_schema_inference/csv_schema_inference.py", line 165, in parallel return [p.get() for p in results] File "/home/greg/prj/sdspop/ingest/workflows/schema-on-read/venv/lib/python3.10/site-packages/csv_schema_inference/csv_schema_inference.py", line 165, in
return [p.get() for p in results]
To Reproduce Steps to reproduce the behavior:
See example below... "id","country","year","sex","age","suicides_no","population","country-year","HDI for year"," gdp_for_year","gdp_per_capita","generation" 0,"Albania",1987,"male","15-24 years",21,312900,"Albania1987",,"2,156,624,900",796,"Generation X" 1,"Albania",1987,"male","35-54 years",16,308000,"Albania1987",,"2,156,624,900",796,"Silent" 2,"Albania",1987,"female","15-24 years",14,289700,"Albania1987",,"2,156,624,900",796,"Generation X" 3,"Albania",1987,"male","75+ years",1,21800,"Albania1987",,"2,156,624,900",796,"G.I. Generation" 4,"Albania",1987,"male","25-34 years",9,274300,"Albania1987",,"2,156,624,900",796,"Boomers" 5,"Albania",1987,"female","75+ years",1,35600,"Albania1987",,"2,156,624,900",796,"G.I. Generation"
See code below... from multiprocessing import freeze_support, Process from csv_schema_inference import csv_schema_inference
def main():
if the inferred data type is INTEGER and there is a presence of FLOAT on the results , then the result will be FLOAT
conditions = {"INTEGER":"FLOAT"} pathfile = "/home/greg/prj/sdspop/ingest/workflows/schema-on-read/suicide_data.csv"
csv_infer = csv_schema_inference.CsvSchemaInference(portion=0.9, max_length=100, batch_size = 200000, acc = 0.8, seed=2, header=True, sep=",", conditions = conditions) aprox_schema = csv_infer.run_inference(pathfile) csv_infer.pretty(aprox_schema)
if name == 'main': freeze_support() Process(target=main).start()
Expected behavior Should have made it to some kind of schema inference. e.g. 0 name Username; Identifier;One-time password;Recovery code;First name;Last name;Department;Location type STRING nullable False ....
Desktop (please complete the following information):