Speed up timeline generation code

bellecarrell / twitter_brand

In developing a brand on Twitter (and social media in general), how does what you say and how you say it correspond to positive results (more followers, for example)?

0 stars 1 forks source link

Speed up timeline generation code #85

Closed abenton closed 5 years ago

abenton commented 5 years ago

Timeline generation takes too long. Speed up by:

parallelizing across input files with multiprocessing module
remove file opens + closes, leave single file open
write info + tweet timeline to single file, all users together

bellecarrell commented 5 years ago

@abenton if moving to all one file, will it be a compressed file? I don't know how big a CSV can get

On Thu, Nov 22, 2018, 7:55 AM Adrian Benton <notifications@github.com wrote:

Timeline generation takes too long. Speed up by:

parallelizing across input files with multiprocessing module

remove file opens + closes, leave single file open

write info + tweet timeline to single file, all users together

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bellecarrell/twitter_brand/issues/85, or mute the thread https://github.com/notifications/unsubscribe-auth/AKaAyJTmYL1gEfKsM9pWwsXqdSeiLnjiks5uxsjcgaJpZM4YvmJ5 .

abenton commented 5 years ago

Yes, compressed, faster to write to gzipped file.

bellecarrell commented 5 years ago

@abenton were you going to look into this? If not I can.

On Thu, Nov 22, 2018, 9:46 AM Adrian Benton <notifications@github.com wrote:

Yes, compressed, faster to write to gzipped file.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/bellecarrell/twitter_brand/issues/85#issuecomment-441082348, or mute the thread https://github.com/notifications/unsubscribe-auth/AKaAyFgagdsFc68aIQ9UyHFvbbMCwCwzks5uxtTlgaJpZM4YvmJ5 .

abenton commented 5 years ago

Am working on it now, hopefully be done soon.

abenton commented 5 years ago

Running timeline generation script now. Sample usage:

python -m collect_users.user_info_timeline_to_dataframe_par --in_dir /exp/abenton/twitter_brand_data/ --out_dir /exp/abenton/twitter_brand/promoting_users/ --num_procs 10

Expected to finish processing processing user infos + tweets in 6 hours.

abenton commented 5 years ago

Tables written to:

User information -- /exp/abenton/twitter_brand/promoting_users/info/user_info_dynamic.tsv.gz
User tweets -- /exp/abenton/twitter_brand/promoting_users/timeline/user_tweets.noduplicates.tsv.gz

Some lines in user_tweets.noduplicates.tsv.gz may be difficult to parse. I recommend reading with csv module and skipping errorful lines.