Closed numeroteca closed 3 years ago
This might sound like a non-answer, but first can I ask how you are planning on comparing the different datasets?
I have different approaches to compare both datasets. I've been collecting Twitter data with a tool that helps interact with the streaming API during years long (t-hoarder https://github.com/congosto/t-hoarder) and now I am using Twarc for a particular period (March-April 2018).
I am studying how different it is to use the tweets I got back then (2018, for a particular day around 600.000+ tweets) and the new search I've done with Twitter API v2.0 and twarc2 of that same days (2021, for a particular day around 400.000+ tweets).
In my initial search I got some "holes" in the data when t-hoarder stopped working: (in blue tweets that are in Twarc database, in pink number of tweets that are not)
It seems stable he amount of tweets per day that are lost after 3 years: around 30-40%.
Search with Twarc (opposite analysis):
Twarc gets around 5% of tweets per day that t-hoarder was not able to obtain.
The percentage by hour shows a similar amount of tweets that the one observed daily, for the hours of more activity, for night hours and early in the morning, the loss of tweets is much higher.
Pending questions are to analyze which tweets have disappeared (removed, protected): are they from a particular type of users? particular kind of content?
Analyze with Gephi the support and attack networks of RT (I am analyzing political scandals in Spain) in both data sets. Are the same networks detected? For this I need to trim the downloaded Twarc tweets.
I have already analyzed the RT networks in the t-hoarder database. My problem is to set the same time limits to the Twarc database and filter only to analyze RT relationships (which are the more important to show the creation of support-attack networks) and not all the relationships. I've already produced different Gephi analysis networks with the Twarc tweets).
So far I am also testing network analysis with igraph in R, but I have to trim the graphs in order to be able to process the files in R, otherwise it collapses. My ultimate goal is to automatize the process of network analysis. By now it is a very manual process with Gephi (see step by step guide http://periodisme-dades.recursos.uoc.edu/es/6-1-4-preguntas-a-resolver/ in Spanish).
Wow, this is great @numeroteca. I especially like how you comparing these different tools, and am really interested to see what you find.
Since you seem to be comfortable working in R does it make sense to subset the data by date using R? I was going to suggest a small Python program, but if you are more familiar with R that should be the way to go. Each tweet has a created_at
property that should be usable to partition the data.
To work the .jsonl files with R what I do is convert them to .csv, read them in R and process them. I can easily filter by datetime in R. The problem with that is that to create the .gexf with twarc2 I need the .jsonl. That is the question: **can I create the .gexf with the processed .csv (or something I output from R)? That's why I was looking for a way to do this without entering R.
My goal was to streamline the process.
On a side note, this is for example a script I made in bash to convert all my .jsonl to .gefx to open them in Gephi:
#!/bin/bash -
timestamp() {
date +"%T" # current time
}
for f in original/name-of-files*.jsonl;
do
printf "$f\n" >> mycifuentes.txt;
echo "$(timestamp): here we go"
echo $f;
# create gexf file
twarc2 network $f --format gexf network/$f.gexf
done
In case it's helpful here is how I would do it in Python. This program will read in data collected with twarc2 and will filter out any tweets that weren't sent between 2021-07-14 00:00:00 and 2021-07-15 00:00:00 UTC.
import json
from datetime import datetime, timezone
from dateutil.parser import parse as parse_date
start = datetime(2021, 7, 14, tzinfo=timezone.utc)
end = datetime(2021, 7, 15, tzinfo=timezone.utc)
for line in open('tweets.jsonl'):
data = json.loads(line)
# get the last tweet in the response
tweet = data['data'][-1]
# get the time the tweet was created
created_at = parse_date(tweet['created_at'])
# print out the original data if it falls within the start/end range
if created_at >= start and created_at <= end:
print(line, end='')
Maybe this kind of functionality would be a useful plugin? Something like:
twarc2 filter-dates --start 2021-07-14 --end 2021-07-15 tweets.jsonl > filtered-tweets.jsonl
It will be definitively be useful for me! That would allow to download tweets in a big container and then split them in the chunks needed. I'd need the extra feature to select the time of the day as well, not only the date. I don't know if that complicates it too much.
I wonder if we can do both - https://github.com/DocNow/twarc/issues/496#issuecomment-867935406 i had a similar idea for splitting the output. Maybe instead of a plugin it should be in core twarc? Since it won't require any extra dependencies and will be useful to have.
@igorbrigadir was your idea here to have a twarc command for binning the data by time? It sounds like @numeroteca is interested in a very specific time range, so that might not help here. That being said, a way to take collected data and bin it by day/month/year would be super useful.
@numeroteca since it seems like you have some facility with R is there a reason why you don't read in the JSON and filter it yourself before sending it off to twarc2 network
?
Oh, i meant we should have both meaning the twarc2 filter-dates
command to filter on dates (but not as a plugin, in main twarc) and also have a command or output option to split output files too - i like this idea https://github.com/DocNow/twarc/issues/496#issuecomment-905448755
@numeroteca since it seems like you have some facility with R is there a reason why you don't read in the JSON and filter it yourself before sending it off to
twarc2 network
?
Certainly not, I could filter it with R. I haven't tried, but I guess it should work. I'll try an report back. I was just trying to make as much as possible with twarc before entering the analysis with R.
Closing - I think we can use #496 to track any movement on this one.
In case it's helpful here is how I would do it in Python. This program will read in data collected with twarc2 and will filter out any tweets that weren't sent between 2021-07-14 00:00:00 and 2021-07-15 00:00:00 UTC.
I am trying to use the python script @edsu provided, but I don't know how to modify it to output and save filtered result in a json file. Could you help with this?
Oh, sure, something like this maybe:
import json
from datetime import datetime, timezone
from dateutil.parser import parse as parse_date
start = datetime(2021, 7, 14, tzinfo=timezone.utc)
end = datetime(2021, 7, 15, tzinfo=timezone.utc)
for line in open('tweets.jsonl'):
data = json.loads(line)
# get the last tweet in the response
tweet = data['data'][-1]
# get the time the tweet was created
created_at = parse_date(tweet['created_at'])
# print out the original data if it falls within the start/end range
if created_at >= start and created_at <= end:
with open("output.jsonl") as f:
f.write(line + '\n')
I've split in consecutive dates periods my searches with Twarc. I am able to join the files easily with
cat file1.jsonl file2.jsonl
.Now what I need is to specifically select certain tweets from the downloaded tweets: from one date and time to another. Is there an easy straight forward way to do it?
Why? I am comparing Twarc downloaded dataset with previously donwloaded data sets (with streaming API) and I need to have the same period.