DocNow / twarc-csv

A plugin for twarc2 for converting tweet JSON into DataFrames and exporting to CSV.
MIT License
31 stars 10 forks source link

Parquet output format #29

Open igorbrigadir opened 3 years ago

igorbrigadir commented 3 years ago

Instead of CSVs, append the parsed dataframes to parquet https://stackoverflow.com/a/47839247/11090908

edsu commented 3 years ago

Being able to output as parquet would be nice too--even if it's called twarc-csv :-)

igorbrigadir commented 3 years ago

Yeah I'm actually considering a different command as an alias, just for it to make semantic sense / good docs, so these would be the same:

twarc2 dataframe --output-format parquet input.json output.parquet

twarc2 csv --output-format parquet input.json output.parquet

But not sure how useful that is. It'll purely be an alias for a docs entry and for the command line.

edsu commented 3 years ago

I was going to say that pandas has many output formats. It might not be hard to add parquet, pickle, hdf, sql, excel, json, html, feather, latex, stata, gbq, markdown, ... :-) but like you said, figuring out the api is the hard part.

igorbrigadir commented 3 years ago

Yeah - still figuring out that part!

igorbrigadir commented 3 years ago

Still haven't figured this out, but for now, you can use DataFrameConverter to get a python DataFrame object which you can convert yourself. I'll keep this open for implementing the actual command later.

Maybe an alias?

twarc2 dataframe input.jsonl output.parquet

or

twarc2 dataframe --output-format parquet input.jsonl output.parquet

or

twarc2 csv --output-format parquet input.jsonl output.parquet