Open igorbrigadir opened 2 years ago
I'm interested in hearing where the need for this optimization arose. Was it a problem generating the CSV, or reading the generated CSV in another application? It sounds like the latter?
Just trying to deduplicate columns and remove mostly empty ones, so more can fit into memory, and other tools like great expectations or pandas profiling have an easier time.
Would being able to write to parquet help in situations like that?
Yep! Definitely i think #29 goes hand in hand with this - I think all of these things are basically the same task for me to do lol
The current output favors preserving as much information as possible from the original json, but there is some duplication, and a bunch of columns can be removed as they're rarely super useful.
The new
--optimized
mode will generate CSVs that drop a bunch of columns to save space:(exact list to be revised later)
These are the most commonly not present or duplicate ones, where the missing data can be inferred from the columns left over, or with the cashtags, hashtags, mentions, with twitter-text for example.
Should probably fix #36 and #47 first before this.