Develop script to automatically update master data file

swood-ecology commented 5 years ago

We are regularly adding new data files based on Twitter searches run every ~10 days or so. The master file quickly gets out of date. Could use a script that auto-updates the master file whenever beginning to work with the data.

grantnolasco commented 5 years ago

I have an idea on how to do this. First, I update the code that creates the recent dataset to include the master dataset, which then I can just merge the two datasets together. Second, we can use the package called cronR and then use its add-in to run the script every week or so. I don't think this process would be difficult. However, I am experiencing problems with creating the token as I am getting this error:

"createTcpServer: address already in use Error in httpuv::startServer(use$host, use$port, list(call = listen)) : Failed to create server"

I am thinking it's because I am using the Aurora server as the code seems to work locally. Therefore, I plan on using a similar solution to this issue: https://github.com/mkearney/rtweet/issues/156. However, I would need to put the token created locally to the server, which I do not know how to do.

grantnolasco commented 5 years ago

I finished creating a local token and importing it to the server, which fixed the problem. Once I was done with that, I ran the twitter.R code to acquire the data from Twitter API and began to replicate the preprocessing steps for API dataset from the raw_data_processing.R script. Once preprocessing steps were done, I merged the master dataset with the new data to a practice master file (I didn't want to overwrite the master dataset in case there's any problems with the way I'm merging the datasets together so I created a dummy file). I have the code and dummy files in the Aurora server, but I'm having problems pushing the code to the Github link: https://github.com/Science-for-Nature-and-People/soc-twitter. I plan on talking to Julien about my problem on Wednesday! My next plan is to figure out how to push the code to Github and having the code to automate the script.

Also, the most recent data on Aurora pulled from API was March of last year. I'm curious if there are any more recent files from the API as this would create a pretty significant skip in time if we were to make any time series plots.

swood-ecology commented 5 years ago

The data generated by the twitter.R code actually lives on our team Google Drive: SNAPP-Soil_Carbon > Data > Twitter

What would be awesome is to have a script that automatically runs twitter.R every 7 days to run a new search and appends those new results to a master file. So that would require generating a master .csv in the Google Drive folder (or you could do it all in aurora) and then probably changing the twitter.R script so it imports a master file, appends the new search results, and then re-exports the same master file.

Does cronR run the script automatically?

grantnolasco commented 5 years ago

I was able to finish the automation script and followed the steps as mentioned above. This can be seen on this link: https://github.com/Science-for-Nature-and-People/soc-twitter/blob/automation/automate.R. Let me know if the code isn't clear enough or if there are any problems. For now, I plan on working on adding the csv files from the Google Drive folder to the master dataset.

Also, cronR runs the script automatically.

swood-ecology commented 5 years ago

Could you separate out the automated raw data download from the data processing? I'd like to keep the raw data as unchanged as possible in case we want to go back to it.

In this script I would only run a new search, rbind those results with a master csv, export that, and have the whole script on a 7-day automated cycle.

The other formatting / variable selection stuff I'd have in a separate data formatting script that can be run on the master.csv before doing analyses.

Also, for those two master.csv files (one with RT and one without), do those include both the live API and the original purchased data? I feel like we should keep those data streams separate as well and have a separate script that merges those.

swood-ecology commented 5 years ago

@grantnolasco Is the cron script running right now or should I still be downloading data manually?

brunj7 commented 5 years ago

Hi @swood-ecology , We still need to do some testing on this. It is better if you keep downloading data from the API manually for now.

Regarding the workflow, we are working on a set of functions separating the download from the processing and merging steps. The idea is that since the API csv files will be storing all the information (and should not be deleted), we do not need to save an intermediate data sets and just update the RT noRT final datasets.

grantnolasco commented 5 years ago

I'm currently finishing the function to download the tweets from search_tweets2 function and then writing it to a csv. However, there are some columns that are list, which gives me this error: "Error in write.table(twitterAPI_new, file.name, col.names = NA, sep = ",", : unimplemented type 'list' in 'EncodeElement'" I'm thinking of just converting the columns classified as lists to character.

grantnolasco commented 5 years ago

While I was working on the merge_master function, I noticed that the latest csv's (late 2018/2019) didn't have as much values in the text columns with RT in the beginning compared to the earlier csv. I was seeing a lot of repeat tweets among different users, which should be incredibly rare. Also, there were some tweets that had RT in the beginning of the tweet even though it actually wasn't a retweet. Hence, this makes it unreliable to use RT in the beginning of tweets as a way to filter out retweets.

Therefore, I am trying to find a way to merge the earlier csv with the later csv regarding retweets. I'm thinking of using the retweet column from API csv, but the problem is that there doesn't seem to be a column similar to that in the json file. I'm interested to hear your thoughts about this situation.

Lastly, I need to work on figuring out the problem with my code regarding the column hashtags. The merged dataset has different values compared to the latest csv from the API. Should be an easy fix though.

grantnolasco commented 5 years ago

I finished updating the api_csv function to fix the processing of the hashtag columns. I'm currently working on the RT problem by looking the urls and seeing whether RT in the beginning of the tweets are reliable. Also, I uploaded the code to Github: https://github.com/Science-for-Nature-and-People/soc-twitter/blob/automation/downloadAPI.R

swood-ecology commented 5 years ago

awesome. is this the code I should start using for downloading from the API? or should I wait until you fix the RT problem before switching code?

grantnolasco commented 5 years ago

I think you should wait until I fix the RT problem before switching the code.

For my progress today, it seems like we're going to be using is_retweet as an indicator if the tweet was a retweet or not instead of using RT in the beginning of the tweet. For the archive json, we will be using RT in the beginning as the indicator as seen in this documentation: http://support.gnip.com/articles/identifying-and-understanding-retweets.html. Currently, I'm trying to figure out if there's a way to actually determine whether the tweet is a retweet by comparing the body and object.body column. It seems like if it isn't a retweet, then object.body == NA. Also, I want to experiment with the object.type column. However, I need to do some more testing on that. Once I'm done with that, I want to create a new column for the json dataset called is_retweet so it matches up with the API csv. I will probably create a new function for that.

swood-ecology commented 5 years ago

Sounds good! Thank you!

From: grantnolasco notifications@github.com Reply-To: Science-for-Nature-and-People/soc-twitter reply@reply.github.com Date: Monday, April 22, 2019 at 6:47 PM To: Science-for-Nature-and-People/soc-twitter soc-twitter@noreply.github.com Cc: Stephen Wood stephenawood@gmail.com, Mention mention@noreply.github.com Subject: Re: [Science-for-Nature-and-People/soc-twitter] Develop script to automatically update master data file (#1)

I think you should wait until I fix the RT problem before switching the code.

For my progress today, it seems like we're going to be using is_retweet as an indicator if the tweet was a retweet or not instead of using RT in the beginning of the tweet. For the archive json, we will be using RT in the beginning as the indicator as seen in this documentation: http://support.gnip.com/articles/identifying-and-understanding-retweets.html. Currently, I'm trying to figure out if there's a way to actually determine whether the tweet is a retweet by comparing the body and object.body column. It seems like if it isn't a retweet, then object.body == NA. Also, I want to experiment with the object.type column. However, I need to do some more testing on that. Once I'm done with that, I want to create a new column for the json dataset called is_retweet so it matches up with the API csv. I will probably create a new function for that.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/Science-for-Nature-and-People/soc-twitter/issues/1#issuecomment-485578714, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AD6Y5RAQ3DHNUBIEZ5KWPELPRY6AVANCNFSM4GXFRNAA.

grantnolasco commented 5 years ago

While creating function to add is_retweet column to master file, it seems like there's a problem with the CSV files as there seems to be a problem with the description column for certain tweets. So, some tweets have something in the description column that cuts off and then creates a new row. For example, this person has a twitter bio "Head of Horticulture Soil Association. Eastbrook Farm #Agroforestry. Author. Occasional Folk Singer @Mayfish2016 Even more occasional knitter.", but it creates three rows where the description value is cut off ("Head of Horticulture Soil Association. ") and then the next two rows have user_id of "Eastbrook Farm #Agroforestry. Author. ", "Occasional Folk Singer @Mayfish2016", and "Even more occasional knitter". I am currently unaware on how to fix this since it doesn't seem like there's a pattern.

Also, the fixed csv or the latest test on api_csv function doesn't have this problem.

swood-ecology commented 5 years ago

interesting--and annoying--problem. for each unique screen_name could you concatenate as a string all of the corresponding entries in user_id? if you were worried that other things were duplicated, you could create a separate lookup table for users vs. tweets where the user database had everything on the user, and the tweet database had everything on the tweet. then they'd be linked by the screen_name variable. not sure if that makes any sense since you know the data better than I do.

grantnolasco commented 5 years ago

While looking into the problem, it seems like there are only 30 instances of this throughout the folder of API_csv. Also, this problem only happens with five users (Ben_Raskin, Gfoe_ovg, BPlanques, outPopco, ajrajbjp). For some reason, there's a problem with their bio such that it creates a new row in the middle of their bio. I can't figure out what's so different about their bio. I could either fix this problem manually but it might take a good amount of time to fix or I can just remove the 30 instances.

swood-ecology commented 5 years ago

I just looked up their profiles to see if they’re relevant users. I think we want @Ben_Raskin and @ajrajbjp in the data set but the others could be dropped. Would that cut down on time to do a manual fix?

Sent from my iPad

On May 3, 2019, at 19:19, grantnolasco notifications@github.com wrote:

While looking into the problem, it seems like there are only 30 instances of this throughout the folder of API_csv. Also, this problem only happens with five users (Ben_Raskin, Gfoe_ovg, BPlanques, outPopco, ajrajbjp). For some reason, there's a problem with their bio such that it creates a new row in the middle of their bio. I can't figure out what's so different about their bio. I could either fix this problem manually but it might take a good amount of time to fix or I can just remove the 30 instances.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

grantnolasco commented 5 years ago

Julien and I figured out the problem due to a hidden character in their bios (^M), which caused the shift in rows. I'm currently finishing up is_retweet function and should be done soon! The main problem I'm experiencing is that Aurora keeps on failing on me when attempting to concatenate all of the archive csv files together. It's taking a long time or the server just crashes on me.

grantnolasco commented 5 years ago

I finished creating the following functions:

is_retweet: adds the column is_retweet to the master file for a more accurate indicator of retweets
api_csv: searches tweets using Twitter API, create csv file based on the new tweets, and then send it over to API_csv directory
merge_master: merges a csv file to the RT and no RT master files

Let me know if there are any mistakes!

swood-ecology commented 5 years ago

Great. Should I close this issue? Is this fully automated and downloading data yet?

brunj7 commented 5 years ago

Let me do a code review and then we can decide

brunj7 commented 5 years ago

@grantnolasco I just pushed a new version with my comments cfe3a1d.

Looks good!

My main comment is that the check on duplicates (ie if the tweet is already in the master file) is missing. I put a placeholder in the code where I think it should be done.

Other comments:

write a csv with the new data you get from the API
the hastags column is still a vector when you transform the API into a dataframe (line 35); you should collapse that
line 75: not sure why we are replacing NA with ""
line 89: I think it is good to keep the iso code and country names as it is easier to use for joins;
line 114: you might want to do this before splitting the dataset into 2

grantnolasco commented 5 years ago

Tasks finished

I added the is_retweet function to the automate.R script to add is_retweet column to master. Also, I added the additional data from Google Data in this function. Fixed some additional problems with function too
I wrote a csv with the new data you get from the API
Regarding the question on why we are replacing NA with "", I'm just copying the steps from raw_data_processing.R-
Collapsed the hashtags column
Kept the columns for iso code and country names

Working on

Removing duplicates

swood-ecology commented 5 years ago

Fantastic. I'm still manually downloading the data until you tell me to stop.

swood-ecology commented 5 years ago

Fantastic. I'm still manually downloading the data until you tell me to stop.

grantnolasco commented 5 years ago

I finished the code that should be removing any duplicates. I'll be testing whether the duplicates are actually removed on Friday, but it should work! Let me know if there's anything I should fix or if I didn't do any of the suggested comments.

grantnolasco commented 5 years ago

A summary of what I've done:

There are two files created for this task, which are found in the automation branch:

is_retweet.R: script that creates a function to add retweet column to json file and then to merge it with the csv files found in API_csv folder. Before this, we were using RT in the beginning of each tweet as an indicator that it was a retweet. However, we found out that recent updates to Twitter API don't add RT anymore and people actually would have tweets starting with RT even though they actually didn't. Hence, this function fixes that problem
automate.R: script that will update the master file with new tweets. The new tweets requested by the twitter api package are sent to a csv folder called API_csv, which will be unprocessed. After that, these new tweets are processed to match the master files similar to the steps shown in raw_data_processing.R

kylemonper commented 5 years ago

@brunj7 minor bug: there is a disconnect within the new data frames between the 'query' and 'hits' columns that is leading me to believe that the 'hits' column is no longer being updated

swood-ecology commented 5 years ago

Do you think this is a bug with the recent automation script? Or has it been happening for more than a few weeks?

From: kylemonper notifications@github.com Reply-To: Science-for-Nature-and-People/soc-twitter reply@reply.github.com Date: Monday, June 24, 2019 at 6:08 PM To: Science-for-Nature-and-People/soc-twitter soc-twitter@noreply.github.com Cc: Stephen Wood stephenawood@gmail.com, Mention mention@noreply.github.com Subject: Re: [Science-for-Nature-and-People/soc-twitter] Develop script to automatically update master data file (#1)

@brunj7https://github.com/brunj7 minor bug: there is a disconnect within the new data frames between the 'query' and 'hits' columns that is leading me to believe that the 'hits' column is no longer being updated

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/Science-for-Nature-and-People/soc-twitter/issues/1?email_source=notifications&email_token=AD6Y5RBOZAECB2PGQTWFTPTP4EEQVA5CNFSM4GXFRNAKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYNYAFY#issuecomment-505118743, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AD6Y5RFAIGR2DWTLP4JKZWLP4EEQVANCNFSM4GXFRNAA.

kylemonper commented 5 years ago

it goes back at least a few months. So far I've only noticed it in regards to regenerative agriculture (there are no hits on those terms within the data frame) so maybe the code for populating the hits just hasn't been updated to include regenerative ag terms? I'm going to dig into a bit more and see how it's working with the other query terms

Edit: I'm pretty sure that the problem is confined to regenerative agriculture not being updated in the hits column.

swood-ecology commented 5 years ago

Looks like this is up-and-running

Science-for-Nature-and-People / soc-twitter

Develop script to automatically update master data file #1

Tasks finished

Working on