How to filter by datetime downloaded tweets?

numeroteca commented 3 years ago

I've split in consecutive dates periods my searches with Twarc. I am able to join the files easily with cat file1.jsonl file2.jsonl.

Now what I need is to specifically select certain tweets from the downloaded tweets: from one date and time to another. Is there an easy straight forward way to do it?

Why? I am comparing Twarc downloaded dataset with previously donwloaded data sets (with streaming API) and I need to have the same period.

edsu commented 3 years ago

This might sound like a non-answer, but first can I ask how you are planning on comparing the different datasets?

numeroteca commented 3 years ago

I have different approaches to compare both datasets. I've been collecting Twitter data with a tool that helps interact with the streaming API during years long (t-hoarder https://github.com/congosto/t-hoarder) and now I am using Twarc for a particular period (March-April 2018).

I am studying how different it is to use the tweets I got back then (2018, for a particular day around 600.000+ tweets) and the new search I've done with Twitter API v2.0 and twarc2 of that same days (2021, for a particular day around 400.000+ tweets).

1. Count by day the number of tweets in one database and in another

In my initial search I got some "holes" in the data when t-hoarder stopped working: (in blue tweets that are in Twarc database, in pink number of tweets that are not) cifuentes_bar_thoarder-in-twarc

It seems stable he amount of tweets per day that are lost after 3 years: around 30-40%.

cifuentes_bar_thoarder-in-twarc_perc

Search with Twarc (opposite analysis): cifuentes_bar_twarc-in-thoarder

Twarc gets around 5% of tweets per day that t-hoarder was not able to obtain.

cifuentes_bar_twarc-in-thoarder_perc

2. For a particular day which number of tweets are in one database, other or both

cifuentes_tweets_180426_source

The percentage by hour shows a similar amount of tweets that the one observed daily, for the hours of more activity, for night hours and early in the morning, the loss of tweets is much higher.

cifuentes_tweets_180426_source_perc

Pending questions are to analyze which tweets have disappeared (removed, protected): are they from a particular type of users? particular kind of content?

3. Network analysis

Analyze with Gephi the support and attack networks of RT (I am analyzing political scandals in Spain) in both data sets. Are the same networks detected? For this I need to trim the downloaded Twarc tweets.

I have already analyzed the RT networks in the t-hoarder database. My problem is to set the same time limits to the Twarc database and filter only to analyze RT relationships (which are the more important to show the creation of support-attack networks) and not all the relationships. I've already produced different Gephi analysis networks with the Twarc tweets).

Screenshot from 2021-07-04 12-24-59

So far I am also testing network analysis with igraph in R, but I have to trim the graphs in order to be able to process the files in R, otherwise it collapses. My ultimate goal is to automatize the process of network analysis. By now it is a very manual process with Gephi (see step by step guide http://periodisme-dades.recursos.uoc.edu/es/6-1-4-preguntas-a-resolver/ in Spanish).

rplot_p

edsu commented 3 years ago

Wow, this is great @numeroteca. I especially like how you comparing these different tools, and am really interested to see what you find.

Since you seem to be comfortable working in R does it make sense to subset the data by date using R? I was going to suggest a small Python program, but if you are more familiar with R that should be the way to go. Each tweet has a created_at property that should be usable to partition the data.

numeroteca commented 3 years ago

To work the .jsonl files with R what I do is convert them to .csv, read them in R and process them. I can easily filter by datetime in R. The problem with that is that to create the .gexf with twarc2 I need the .jsonl. That is the question: **can I create the .gexf with the processed .csv (or something I output from R)? That's why I was looking for a way to do this without entering R.

My goal was to streamline the process.

On a side note, this is for example a script I made in bash to convert all my .jsonl to .gefx to open them in Gephi:

#!/bin/bash -
timestamp() {
  date +"%T" # current time
}
for f in original/name-of-files*.jsonl;
do
printf "$f\n" >> mycifuentes.txt;
echo "$(timestamp): here we go"
echo $f;
# create gexf file
twarc2 network $f --format gexf network/$f.gexf
done

edsu commented 3 years ago

In case it's helpful here is how I would do it in Python. This program will read in data collected with twarc2 and will filter out any tweets that weren't sent between 2021-07-14 00:00:00 and 2021-07-15 00:00:00 UTC.

import json

from datetime import datetime, timezone
from dateutil.parser import parse as parse_date

start = datetime(2021, 7, 14, tzinfo=timezone.utc)
end = datetime(2021, 7, 15, tzinfo=timezone.utc)

for line in open('tweets.jsonl'):
    data = json.loads(line)

    # get the last tweet in the response
    tweet = data['data'][-1]

    # get the time the tweet was created
    created_at = parse_date(tweet['created_at'])

    # print out the original data if it falls within the start/end range
    if created_at >= start and created_at <= end:
        print(line, end='')

edsu commented 3 years ago

Maybe this kind of functionality would be a useful plugin? Something like:

twarc2 filter-dates --start 2021-07-14 --end 2021-07-15 tweets.jsonl > filtered-tweets.jsonl

numeroteca commented 3 years ago

It will be definitively be useful for me! That would allow to download tweets in a big container and then split them in the chunks needed. I'd need the extra feature to select the time of the day as well, not only the date. I don't know if that complicates it too much.

igorbrigadir commented 3 years ago

I wonder if we can do both - https://github.com/DocNow/twarc/issues/496#issuecomment-867935406 i had a similar idea for splitting the output. Maybe instead of a plugin it should be in core twarc? Since it won't require any extra dependencies and will be useful to have.

edsu commented 3 years ago

@igorbrigadir was your idea here to have a twarc command for binning the data by time? It sounds like @numeroteca is interested in a very specific time range, so that might not help here. That being said, a way to take collected data and bin it by day/month/year would be super useful.

@numeroteca since it seems like you have some facility with R is there a reason why you don't read in the JSON and filter it yourself before sending it off to twarc2 network ?

igorbrigadir commented 3 years ago

Oh, i meant we should have both meaning the twarc2 filter-dates command to filter on dates (but not as a plugin, in main twarc) and also have a command or output option to split output files too - i like this idea https://github.com/DocNow/twarc/issues/496#issuecomment-905448755

numeroteca commented 3 years ago

@numeroteca since it seems like you have some facility with R is there a reason why you don't read in the JSON and filter it yourself before sending it off to twarc2 network ?

Certainly not, I could filter it with R. I haven't tried, but I guess it should work. I'll try an report back. I was just trying to make as much as possible with twarc before entering the analysis with R.

SamHames commented 3 years ago

Closing - I think we can use #496 to track any movement on this one.

numeroteca commented 2 years ago

In case it's helpful here is how I would do it in Python. This program will read in data collected with twarc2 and will filter out any tweets that weren't sent between 2021-07-14 00:00:00 and 2021-07-15 00:00:00 UTC.

I am trying to use the python script @edsu provided, but I don't know how to modify it to output and save filtered result in a json file. Could you help with this?

igorbrigadir commented 2 years ago

Oh, sure, something like this maybe:

import json

from datetime import datetime, timezone
from dateutil.parser import parse as parse_date

start = datetime(2021, 7, 14, tzinfo=timezone.utc)
end = datetime(2021, 7, 15, tzinfo=timezone.utc)

for line in open('tweets.jsonl'):
    data = json.loads(line)

    # get the last tweet in the response
    tweet = data['data'][-1]

    # get the time the tweet was created
    created_at = parse_date(tweet['created_at'])

    # print out the original data if it falls within the start/end range
    if created_at >= start and created_at <= end:
        with open("output.jsonl") as f:
            f.write(line + '\n')

DocNow / twarc