DocNow / twarc-csv

A plugin for twarc2 for converting tweet JSON into DataFrames and exporting to CSV.
MIT License
31 stars 10 forks source link

Document working with larger datasets #41

Closed igorbrigadir closed 1 year ago

igorbrigadir commented 2 years ago

A recurring thing is having to work with "medium data" - several GB datasets that are challenging to work with on a single machine, but may not warrant a distributed system, but are definitely too big for standard approaches.

eg: https://twittercommunity.com/t/saving-tweet-to-csv/153357/41?u=igorbrigadir and other cases.

Need to add more documentation / examples of working with these dataset sizes effectively.

edsu commented 2 years ago

I think a tutorial on data management practices, and things to avoid, would be useful.

igorbrigadir commented 1 year ago

I gave a tutorial over zoom on this and there's a rough notebook here that just deals with pandas dtypes, but i'd like to expand and include much more https://github.com/Analytics-for-a-Better-World/ABW-Academy/tree/main/Practitioners%20Course/Cohort1_2022/Specializations/text-mining/session-03-managing-medium-size-dataset-without-making-your-brain-melt

igorbrigadir commented 1 year ago

This will be part of #55