Wittline / csv-schema-inference

A tool to automatically infer columns data types in .csv files
https://wittline.github.io/csv-schema-inference/
MIT License
33 stars 4 forks source link

Inference is overcomplicated and potentially inaccurate #26

Closed Pansynchro-Admin closed 2 years ago

Pansynchro-Admin commented 2 years ago

The inference system appears a tad overcomplicated. It looks like the workflow for this is:

  1. Open the CSV file
  2. Randomly sample some configurable percentage of the estimated number of rows of the file.
  3. Check these values in parallel, counting types values and inferring types individually on each of them.
  4. Recombine the results, check which type has the heaviest representation, and return that as the answer.

Not only does this iterate the data more than once, it can lead to incorrect results. For example, let's say that we have a float data column where 70% of the values have nothing after the decimal point. This will be reported as INTEGER. If code downstream builds a strongly-typed schema based on this, it will then error out on 30% of the data.

A more accurate detection mechanism would be to have a hierarchical tree of detection precision. At the top is STRING, of course, but for example you could have INTEGER -> FLOAT -> STRING as part of the hierarchy. Keep track of the most precise inference you have so far, and if the current inference doesn't match it, merge the two on whichever type is higher up the hierarchy. This might take a bit of work to set up, but it will yield better results and would most likely run faster as well because you don't need a second iteration for analysis.

Wittline commented 2 years ago

The inference system appears a tad overcomplicated. It looks like the workflow for this is:

  1. Open the CSV file
  2. Randomly sample some configurable percentage of the estimated number of rows of the file.
  3. Check these chunks in parallel, counting types values and inferring types individually on each chunk.
  4. Merge the results, check which type has the heaviest representation, and return that as the answer.

Please check the original workflow in the readme file

  1. The process of merge results only iterates through the number of columns, it is a very fast process, it is a small and summarized structure. it is not the original dataset. There is a way to avoid this process by using a shared data structure between the processors but I have not been able to achieve this behavior.

Not only does this iterate the data more than once, it can lead to incorrect results. For example, let's say that we have a float data column where 70% of the values have nothing after the decimal point. This will be reported as INTEGER. If code downstream builds a strongly-typed schema based on this, it will then error out on 30% of the data.

  1. The process only iterates over the dataset only once, and is using parallelism, there is only one function to infer data types that receives a list of records, the function is being used N times in parallel, N in this case is the number of processors available at that time.

  2. The process is quite fast, that is a sign that the dataset is not iterated more than once, I am caching the results to avoid re-evaluating values in a column that has already been processed.

  3. I agree with this, there must be presence of conditions, for example: FLOAT has more weight over INTEGER only if FLOAT is greater than 0%, In that way I will take into account your entire next paragraph <- working on this

A more accurate detection mechanism would be to have a hierarchical tree of detection precision. At the top is STRING, of course, but for example you could have INTEGER -> FLOAT -> STRING as part of the hierarchy. Keep track of the most precise inference you have so far, and if the current inference doesn't match it, merge the two on whichever type is higher up the hierarchy. This might take a bit of work to set up, but it will yield better results and would most likely run faster as well because you don't need a second iteration for analysis.

  1. already explained above
Pansynchro-Admin commented 2 years ago

To clarify, while the dataset per se is not iterated a second time, it looks like the set of unique values in a column is. In the worst case, where every row has a unique value, the two are one and the same.

Wittline commented 2 years ago

To clarify, while the dataset per se is not iterated a second time, it looks like the set of unique values in a column is. In the worst case, where every row has a unique value, the two are one and the same.

Yes, you are right, organic datasets will always come with biases in their values, it is unlikely to find datasets with real phenomena that have a uniform distribution of symbols and much less likely to have a dataset with many columns with unique values. The parallelism here perhaps dampens the times a bit even if you don't have an advantage with caching.

Wittline commented 2 years ago

The inference system appears a tad overcomplicated. It looks like the workflow for this is:

  1. Open the CSV file
  2. Randomly sample some configurable percentage of the estimated number of rows of the file.
  3. Check these values in parallel, counting types values and inferring types individually on each of them.
  4. Recombine the results, check which type has the heaviest representation, and return that as the answer.

Not only does this iterate the data more than once, it can lead to incorrect results. For example, let's say that we have a float data column where 70% of the values have nothing after the decimal point. This will be reported as INTEGER. If code downstream builds a strongly-typed schema based on this, it will then error out on 30% of the data.

A more accurate detection mechanism would be to have a hierarchical tree of detection precision. At the top is STRING, of course, but for example you could have INTEGER -> FLOAT -> STRING as part of the hierarchy. Keep track of the most precise inference you have so far, and if the current inference doesn't match it, merge the two on whichever type is higher up the hierarchy. This might take a bit of work to set up, but it will yield better results and would most likely run faster as well because you don't need a second iteration for analysis.

I separated the shuffling process from inference process and now I'm iterating the dataset only once, avoiding loading it completely into memory.