Open mingjiecn opened 1 month ago
Thx for the report.
Diving into the code it looks like the sample that is analysed to "guess" the type of the column is hardcoded to 100 rows here. I can reproduce with a csv file with 1 column, 100 rows of zeros followed by a decimal value.
Can you confirm that your data starts with at least 100 lines of zeros ?
Unfortunately I can't think of a workaround right now... Can I ask you what your use case is ? Is it for validation ?
Yes the first several hundred rows are 0s. We use frictionless to validate big tsv files. Right now what I do is to skip the type error if no schema is provided. Let me know if there is a better way. Thank you!
Thx for your feedback. The only way I see is correcting the output of describe inside a schema - but of course your answer shows you thought of that.
Actually the hard coded SAMPLE_SIZE does not seem to be the culprit.
The following csv fails already despite being less than 100 rows :
a,b
0,0
0,0
0,0
0,0
0,0
0,0
0,0
0,0
0,0
0,0
1.2,3.4
I tried the following command, which fails as well :
frictionless describe --sample-size=11 --field-confidence=1 test.csv
So there is something wrong here, I need to investigate further.
Please keep me updated. Thank you so much!
Overview
When the field contain integer and float, frictionless describe this filed as a number type. It works well when the data file is small. But we have some issue with it when our data file is big. For example we have a big data file that is about 2GB, one of the field can have 0 or a float number. For this field, most rows have a value of 0, only a few have a float value. And when frictionless describe the table, it describe this filed as a integer type instead of a number type. It fails to see those float values in this row. Can this bug be fixed? Thanks!
This is the output to describe a small data file (test.tsv) and a big data file (TSTFI46007602.tsv) with the same fields. You can see
ref_score
identified as a number type in the small size file but a integer type in the big size file: