Closed igorbrigadir closed 1 year ago
I think a tutorial on data management practices, and things to avoid, would be useful.
I gave a tutorial over zoom on this and there's a rough notebook here that just deals with pandas dtypes, but i'd like to expand and include much more https://github.com/Analytics-for-a-Better-World/ABW-Academy/tree/main/Practitioners%20Course/Cohort1_2022/Specializations/text-mining/session-03-managing-medium-size-dataset-without-making-your-brain-melt
This will be part of #55
A recurring thing is having to work with "medium data" - several GB datasets that are challenging to work with on a single machine, but may not warrant a distributed system, but are definitely too big for standard approaches.
eg: https://twittercommunity.com/t/saving-tweet-to-csv/153357/41?u=igorbrigadir and other cases.
Need to add more documentation / examples of working with these dataset sizes effectively.