Closed abenton closed 5 years ago
@abenton if moving to all one file, will it be a compressed file? I don't know how big a CSV can get
On Thu, Nov 22, 2018, 7:55 AM Adrian Benton <notifications@github.com wrote:
Timeline generation takes too long. Speed up by:
- parallelizing across input files with multiprocessing module
- remove file opens + closes, leave single file open
- write info + tweet timeline to single file, all users together
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bellecarrell/twitter_brand/issues/85, or mute the thread https://github.com/notifications/unsubscribe-auth/AKaAyJTmYL1gEfKsM9pWwsXqdSeiLnjiks5uxsjcgaJpZM4YvmJ5 .
Yes, compressed, faster to write to gzipped file.
@abenton were you going to look into this? If not I can.
On Thu, Nov 22, 2018, 9:46 AM Adrian Benton <notifications@github.com wrote:
Yes, compressed, faster to write to gzipped file.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/bellecarrell/twitter_brand/issues/85#issuecomment-441082348, or mute the thread https://github.com/notifications/unsubscribe-auth/AKaAyFgagdsFc68aIQ9UyHFvbbMCwCwzks5uxtTlgaJpZM4YvmJ5 .
Am working on it now, hopefully be done soon.
Running timeline generation script now. Sample usage:
python -m collect_users.user_info_timeline_to_dataframe_par --in_dir /exp/abenton/twitter_brand_data/ --out_dir /exp/abenton/twitter_brand/promoting_users/ --num_procs 10
Expected to finish processing processing user infos + tweets in 6 hours.
Tables written to:
User information -- /exp/abenton/twitter_brand/promoting_users/info/user_info_dynamic.tsv.gz
User tweets -- /exp/abenton/twitter_brand/promoting_users/timeline/user_tweets.noduplicates.tsv.gz
Some lines in user_tweets.noduplicates.tsv.gz may be difficult to parse. I recommend reading with csv module and skipping errorful lines.
Timeline generation takes too long. Speed up by: