digitalmethodsinitiative / dmi-tcat

Digital Methods Initiative - Twitter Capture and Analysis Toolset
Apache License 2.0
366 stars 114 forks source link

import csv downloaded with academictwitteR produces bin with empty usernames #439

Open alejandrofeged opened 3 years ago

alejandrofeged commented 3 years ago

I have downloaded a dataset of tweets using the library academictwitteR in R. When I convert it to a csv file, and try to import to my DMI-TCAT machine, it does fine except usernames are empty.

I have posted a detailed question on StackOverflow.

The code used to download and prepare the data is:

library(academictwitteR)
#set_bearer()
require(tibble)

my_query <- build_query(c("Biden", "Trump"))

#Then, use the get_all_tweets to collect data. Make sure to specify data_path and set bind_tweets to FALSE.

tweets=get_all_tweets(
  query = my_query,
  start_tweets = "2016-01-16T00:00:00Z",
  end_tweets = "2016-06-15T00:00:00Z",
  n = 1000,
  data_path = "data/",
  has_mentions = TRUE,
  bind_tweets = FALSE
)

ttt=bind_tweets(data_path = "data/", output_format = "tidy")
write.csv(ttt,"data-tweets.csv")

I later run the php import function using the terminal in my multipass machine:

php import-auto-csv.php data-tweets.csv elections2016

where elections2016 is a previously created bin (I noticed if I don't create it in advance the bin does not show on the capture or analysis interfaces).

The dates are also set as 0000-00-00 for all tweets, but many other fields are imported correctly.

Any help is appreciated.

dale-wahl commented 3 years ago

Hello alejandrofeged, sorry for the delayed response.

I am not familiar with the academictwitteR library, but I have a hunch on what your problem may be. In order for the import-auto-csv.php file to work correctly, you will need to modify to map the column names from your csv to the column names expected by TCAT. On this line of import-auto-csv.php the assumptions function which essentially converts the csv columns to the necessary columns for TCAT.

You will want to go through your csv file and add lines to the assumptions function with each field. E.g.:

'your_csv_author_twitter_user_name_column_name' => 'from_user_name', 
'your_csv_author_twitter_user_id_column_name' => 'from_user_id',                 

Let me know if that helps you any.

alejandrofeged commented 3 years ago

Thank you so much! it did the trick! I was changing the names of the columns on the original file, but replacing their names into the lines you suggested worked like a charm.

any clues as to how to import the dates correctly? The column has the same name.

dale-wahl commented 3 years ago

I'm glad that worked out for you!

For the date, if you map the date column to 'your_csv_date_column_name' => 'created_at:DATEPARSE',, it will try to parse the date for you. I believe it is using this date_parse function to do so. If mapping your date column to created_at:DATEPARSE does not work by itself, you can try reformatting your dates to 2021-08-26 11:30:00 which is the format TCAT expects. Let me know if that's successful.

alejandrofeged commented 3 years ago

it does help but I think it will require more tweaking, it is not importing the correct dates, but at least it is a progress.

One more question: I have been manually creating the bins, if I import to a non created bin it does the process but it does not display in the interface.

dale-wahl commented 3 years ago

You’ll have to find the error message in the logs for me to help with that. It should create the bin with the second argument you provide (election2016). Though it will not create it if it already exists.

laurieresearch commented 3 years ago

Thanks for raising this issue and @dale-wahl for your answers.

With the exception of urls, I've found I can import most fields from academictwitteR's Tidy output format with the following mappings:

            'tweet_id' => array( 'id', 'id_str' ),                          
            'user_username' => 'from_user_name',                            
            'text' => 'text',
            'possibly_sensitive' => 'possibly_sensitive',
            'conversation_id' => 'in_reply_to_status_id',
            'created_at' => 'created_at:DATEPARSE',
            'source' => 'source',
            'author_id' => 'from_user_id',                                  
            'lang' => 'lang',                                               
            'in_reply_to_user_id' => 'in_reply_to_user_id',
            'user_profile_image_url' => 'from_user_profile_image_url',
            'user_description' => 'from_user_description',                                  
            'user_name' => 'from_user_realname',                            
            'user_created_at' => 'from_user_created_at',                    
            'user_url' => 'from_user_url',                                                  
            'user_verified' => 'from_user_verified',
            'user_location' => 'location',                                  
            'retweet_count' => 'retweet_count',
            'like_count' => 'favorite_count',
            'user_tweet_count'` => 'from_user_tweetcount',                   
            'user_list_count' => 'from_user_listed',
            'user_followers_count' => 'from_user_followercount',            
            'user_following_count' => 'from_user_friendcount',              
            'sourcetweet_lang' => 'from_user_lang',
            'sourcetweet_id' => 'quoted_status_id',                         
            'sourcetweet_author_id' => 'retweet_id'                                               

Although a full csv export of the data imported into TCAT returns a populated urls column, running the Tweets Stats module on "url frequency" returns nothing. The populated urls in the full export are all in the native Twitter url format e.g. https://t.co/... and the "urls_expanded" and "urls_followed" columns are empty.

I'd be grateful if anyone has insights into how to address this.

Thanks Laurie

dale-wahl commented 3 years ago

Hello Laurie,

The import script looks for columns named 'urls' or 'expanded_urls', which you can see here, and then attempts to parse them. The script is essentially looking for a comma deliminated list (e.g., http://someplace.com, https://someplace_else.com). If no 'urls' or 'expanded_urls' column is provided in your csv, the script will actually attempt to read the tweet text itself and parse out any urls there (which is what I think it may be doing in your case).

Right here is where the url parsing starts.

I think you will want to either rename your url list column to 'urls' or 'expanded_urls' and see what results you get after that. You could also add the column name used to this array here.

I am not entirely sure if that will resolve your issue with the frequency module, but from the information provided, I want to make sure you are importing that data correctly. Let me know if that helps.

laurieresearch commented 3 years ago

Hi Dale

My attempts to follow your suggestions have so far been unsuccessful – I detail them here in case useful for others. Mapping both ‘url’ and ‘expanded_url’ columns from academictwitteR to the ‘urls’ and ‘urls_expanded’ fields in TCAT results in double entries in the database and confuses the frequency module. Mapping one or other R columns to only the ‘urls’ field in TCAT results in identical entries for urls_expanded and urls_followed (e.g. urls = bit.ly/…, urls_expanded = bit.ly/…, urls_followed = bit.ly/…). As you point out, if no urls column is provided the script parses out the t.co/ url from the tweet text into 'urls' but leaves NULL values in 'urls_expanded' and 'urls_followed'.

The only solution I can find that gives a meaningful frequency analysis (i.e. frequencies of the urls followed rather than t.co/ or bit.ly links) is to decode them in R (using the longurl library) and map the followed url to the TCAT field ‘urls’ …which is convoluted and creates the same problem of identical entries for urls_expanded and urls_followed.

Urls aside, it’s helpful that most fields can be simply imported with above mapping.

Thanks Laurie

dale-wahl commented 3 years ago

Hey Laurie,

Could you point me to the dataset you are trying to import? I would like to test it myself.

The identical entries for urls, urls_expanded, and urls_followed should not be a problem. urls_expanded is an old/deprecated twitter field and, I think, was actually the long forms of urls like bit.ly. urls_followed is supposed to end up populated with the urls_expanded/urls. urls_followed is usually created by TCAT via the URL Expander if it is enabled. That is a script that attempts to follow URLs in order to determine the final destination. Essentially what you are attempting to do to the bit.ly/ urls. The url frequency should not "double count" urls in multiple columns (at least it is not supposed to!).

When you set up TCAT there is a setting to turn on a URL Expander. It is set to y or yes by default. You can look at your setup file (docker would be docker\setup.sh or in your tcat-install-linux.sh file). If that is enabled a script runs periodically and attempts to expand URLs.

The import-auto-csv.php file imports data as if URLs were already followed to their final destination and essentially disables the URL Expander from looking at them. I have not tested this, but, I believe if you stop it from updating the domain, error_code, and url_followed fields, the URL expander should attempt to expand those for you. I believe this would involve commenting out/deleting Lines 338-342 and Line 345. Again, I have not yet tested this to be certain.

laurieresearch commented 3 years ago

Thank you Dale

Could you point me to the dataset you are trying to import?

The dataset was constructed using the same procedure outline by @alejandrofeged. The academictwitteR github is here.

The import-auto-csv.php file… commenting out/deleting Lines 338-342 and Line 345.

I tried commenting out these lines and tested on two TCAT installations (both with URL Expander turned on) but it doesn’t seem to have worked. Looking through the urlexpand.log file I can’t see anything relating to the bins of imported data. The urls column is still t.co links and the expanded/followed urls are NULL.