Science-for-Nature-and-People / soc-twitter

SNAPP - Soil Organic Carbon Twitter data
1 stars 4 forks source link

Fix Old Retweets (Complete) #38

Closed remyknox closed 5 years ago

remyknox commented 5 years ago

Started

Current is_retweet column has ~72k NAs. Need to look into is_retweet.R to see why this might be.

Old retweets have "RT @XXXX:" at the beginning of retweet and is truncatated at ~120 characters in and ending with "...". Look into changing all retweets to have the same body of text at the original tweet.

remyknox commented 5 years ago

Current WIP script can replace the body of each retweet with the original tweet text. Still looking into (struggling) why there are so many NAs in the is_retweet column.

remyknox commented 5 years ago

Issue with ~72k NAs in is_retweet column comes from line 61 and 62 of is_retweet.R. Swapping these lines fixed the issue.

Because RTs that start with "RT @xxxx:" go up to 2018-06-30 UTC, it is best if the RTs are flagged by code used to replaced the RT text. This will be documented in fix_old_retweets.R. Running this code on the master data frame warrants creating a new version of the master df (as suggested by @brunj7).

remyknox commented 5 years ago
remyknox commented 5 years ago

@brunj7 requesting review of fix_old_retweets.R to confirm this wants to be run on entire data frame.

remyknox commented 5 years ago

Dates where retweets that started with "RT @xxxx:" and ended with "..."

Screen Shot 2019-09-23 at 1 18 49 PM
remyknox commented 5 years ago

v3 Data Frames Created

Version 3 data frames have been created using fix_old_retweets.R. Both RT and noRT data frames now have the following modifications:

brunj7 commented 5 years ago

great job ! Let us talk about the best strategy moving forward.