Closed remyknox closed 5 years ago
Current WIP script can replace the body of each retweet with the original tweet text. Still looking into (struggling) why there are so many NAs in the is_retweet column.
Issue with ~72k NAs in is_retweet column comes from line 61 and 62 of is_retweet.R. Swapping these lines fixed the issue.
Because RTs that start with "RT @xxxx:" go up to 2018-06-30 UTC, it is best if the RTs are flagged by code used to replaced the RT text. This will be documented in fix_old_retweets.R. Running this code on the master data frame warrants creating a new version of the master df (as suggested by @brunj7).
@brunj7 requesting review of fix_old_retweets.R to confirm this wants to be run on entire data frame.
Dates where retweets that started with "RT @xxxx:" and ended with "..."
Version 3 data frames have been created using fix_old_retweets.R. Both RT and noRT data frames now have the following modifications:
great job ! Let us talk about the best strategy moving forward.
Started
Current is_retweet column has ~72k NAs. Need to look into is_retweet.R to see why this might be.
Old retweets have "RT @XXXX:" at the beginning of retweet and is truncatated at ~120 characters in and ending with "...". Look into changing all retweets to have the same body of text at the original tweet.