frictionlessdata / frictionless-py

Data management framework for Python that provides functionality to describe, extract, validate, and transform tabular data
https://framework.frictionlessdata.io
MIT License
710 stars 148 forks source link

Frictionless fails to describe the table with the correct field type when the data file is big #1689

Open mingjiecn opened 1 month ago

mingjiecn commented 1 month ago

Overview

When the field contain integer and float, frictionless describe this filed as a number type. It works well when the data file is small. But we have some issue with it when our data file is big. For example we have a big data file that is about 2GB, one of the field can have 0 or a float number. For this field, most rows have a value of 0, only a few have a float value. And when frictionless describe the table, it describe this filed as a integer type instead of a number type. It fails to see those float values in this row. Can this bug be fixed? Thanks!

This is the output to describe a small data file (test.tsv) and a big data file (TSTFI46007602.tsv) with the same fields. You can see ref_score identified as a number type in the small size file but a integer type in the big size file:

(checkfiles_venv) (base) mingjie@Mingjies-MacBook-Pro checkfiles % frictionless describe  src/tests/data/test.tsv
# --------
# metadata: src/tests/data/test.tsv
# --------

name: test
type: table
path: src/tests/data/test.tsv
scheme: file
format: tsv
encoding: utf-8
mediatype: text/tsv
dialect:
  csv:
    delimiter: "\t"
schema:
  fields:
    - name: '#chrom'
      type: string
    - name: start
      type: integer
    - name: end
      type: integer
    - name: spdi
      type: string
    - name: ref
      type: string
    - name: alt
      type: string
    - name: kmer_coord
      type: string
    - name: ref_score
      type: number
    - name: alt_score
      type: number
    - name: relative_binding_affinity
      type: number
    - name: effect_on_binding
      type: string

(checkfiles_venv) (base) mingjie@Mingjies-MacBook-Pro checkfiles % frictionless describe  src/tests/data/TSTFI46007602.tsv
# --------
# metadata: src/tests/data/TSTFI46007602.tsv
# --------

name: tstfi46007602
type: table
path: src/tests/data/TSTFI46007602.tsv
scheme: file
format: tsv
encoding: utf-8
mediatype: text/tsv
dialect:
  csv:
    delimiter: "\t"
schema:
  fields:
    - name: '#chrom'
      type: string
    - name: start
      type: integer
    - name: end
      type: integer
    - name: spdi
      type: string
    - name: ref
      type: string
    - name: alt
      type: string
    - name: kmer_coord
      type: string
    - name: ref_score
      type: integer
    - name: alt_score
      type: integer
    - name: relative_binding_affinity
      type: integer
    - name: effect_on_binding
      type: string
pierrecamilleri commented 3 weeks ago

Thx for the report.

Diving into the code it looks like the sample that is analysed to "guess" the type of the column is hardcoded to 100 rows here. I can reproduce with a csv file with 1 column, 100 rows of zeros followed by a decimal value.

Can you confirm that your data starts with at least 100 lines of zeros ?

Unfortunately I can't think of a workaround right now... Can I ask you what your use case is ? Is it for validation ?

mingjiecn commented 3 weeks ago

Yes the first several hundred rows are 0s. We use frictionless to validate big tsv files. Right now what I do is to skip the type error if no schema is provided. Let me know if there is a better way. Thank you!

pierrecamilleri commented 3 weeks ago

Thx for your feedback. The only way I see is correcting the output of describe inside a schema - but of course your answer shows you thought of that.

Actually the hard coded SAMPLE_SIZE does not seem to be the culprit.

The following csv fails already despite being less than 100 rows :

a,b
0,0
0,0
0,0
0,0
0,0
0,0
0,0
0,0
0,0
0,0
1.2,3.4

I tried the following command, which fails as well :

frictionless describe --sample-size=11 --field-confidence=1 test.csv

So there is something wrong here, I need to investigate further.

mingjiecn commented 3 weeks ago

Please keep me updated. Thank you so much!