Open luisignaciomenendez opened 2 years ago
Do you have a sample of what your dataframe contains? How is it generated in the first place? It's hard to say or compare it to the code otherwise.
Sure, I tried with a random sample using :
twarc2 sample sample.jsonl
( I have also done some extra trials but this is the most inmediate one). I know this is hardly replicable as its using a live stream of tweets but I will try to attach/send you the original file that I have.
Here are the results: (for twarc only those that appear with a count=2)
twarc2 hashtags sample.jsonl
from my code:
@luisignaciomenendez I think @igorbrigadir means where df
comes from in:
# Using the generator to create a new dataframe.
a = pd.DataFrame(list(hash_retrieve(df)),
columns=['hashtag', 'id'])
Is df
loaded from a CSV generated with twarc2 csv
?
@luisignaciomenendez I think @igorbrigadir means where
df
comes from in:# Using the generator to create a new dataframe. a = pd.DataFrame(list(hash_retrieve(df)), columns=['hashtag', 'id'])
Is
df
loaded from a CSV generated withtwarc2 csv
?
Yes,exactly. I converted it using twarc2 and then it is loaded with pandas.
I'm a little bit confused by your code but I do think you've found a difference in how twarc-hashtags works and what is in the entities.hashtags
column that twarc-csv generates.
It looks like twarc-csv includes not only the tweets that were collected but also tweets that those tweets reference (replies and quotes) or so called "includes".
Personally I would expect to only get hashtags for the tweets that were collected, not the tweets that were referenced. But I guess having an --all
flag to get all might be appropriate?
I wonder if users of twarc-csv understand this behavior when using the data though ...
I think i found what the problem is - It's retweets. twarc-csv processes retweets so that they match what you would expect to find, using the full text of the tweet, not what the json actually contains. So, For a retweet in the json like this:
{
"entities": {
"hashtags": [
{
"start": 107,
"end": 115,
"tag": "EndSARS"
}
]
},
"id": "1388203310327508995",
"referenced_tweets": [
{
"type": "retweeted",
"id": "1388174000472432650"
}
],
"text": "RT @abjghost: @imoleayomichael was abducted by DSS at 2.30am in his residence and detained for 41days over #EndSARS protest. They still wan…"
}
The retweet is truncated, so only 1 Hashtag is counted by twarc-hashtags: EndSARS
While the twarc-csv code, will dig into the referenced tweet, 1388174000472432650
which is:
{
"entities": {
"urls": [
{
"start": 280,
"end": 303,
"url": "https://t.co/fDgTVvbQBZ",
"expanded_url": "https://twitter.com/abjghost/status/1388174000472432650/photo/1",
"display_url": "pic.twitter.com/fDgTVvbQBZ"
}
],
"mentions": [
{
"start": 0,
"end": 16,
"username": "imoleayomichael"
}
],
"hashtags": [
{
"start": 93,
"end": 101,
"tag": "EndSARS"
},
{
"start": 224,
"end": 237,
"tag": "FreeImoleAyo"
}
]
},
"id": "1388174000472432650",
"in_reply_to_user_id": "927129038933626880",
"text": "@imoleayomichael was abducted by DSS at 2.30am in his residence and detained for 41days over #EndSARS protest. They still want to convict him.\n\nImoleayo is a Programmer NOT A CRIMINAL!\n\nPls lend your voice in solidarity to \n#FreeImoleAyo\nIt could be you or me.\nPls tweet, RT, Tag https://t.co/fDgTVvbQBZ"
}
So it will count 2 hashtags.
A second source of variation is that twarc-hashtags ignores case, while your code is case sensitive, so EndSARS
and endsars
will be separate for example. Also, ensure_flattened(data)
is meant more for handling entire responses not small json objects within tweets, but since the function is robust enough to handle that it's ok to keep using it like that. It simply does not do any thing to the data, so you can leave it out, and have for hashtag in data:
These aren't mistakes or bugs as such, they're just different things that we should be aware of and decide to count one way or another.
Personally, i'm inclined to to edit twarc-hashtags to count the retweeted hashtags same as twarc-csv, and keep it ignoring the case, same as twitter UI. This does mean adding a bit more code but i think it's less surprising to users, becuause if someone were to manually verify a count, they should match.
I think that twarc-csv is including hashtags from tweets that are referenced using conversation_id and also replies? That was the source of one discrepency at least. I thought that twarc-hashtags was counting retweets. If that's not the case it definitely feels like a bug in twarc-hashtags. I'm not sure it makes sense to count hashtags in tweets that are being replied to, quoted etc though -- unless asked to? I might need to think about this. I guess as a user of a hashtag report I'd want to see counts for tweets that I collected, not tweets related to the tweets I collected, but this is a fuzzy area where one tweet begins and ends.
I think that twarc-csv is including hashtags from tweets that are referenced using conversation_id and also replies?
It used to, but by default in the latest version, no. Just the original tweets merged into the retweets.
Also agree with not counting them from all referenced tweets like replies. Quotes are different though - the quote tweet itself yes, but the quoted tweet? I'm not sure. Right now it will count the quote itself but not the quoted tweet. Still on the fence here too. I guess making command line switches for this will work.
Some of this overlaps with what i was planning with https://github.com/DocNow/twarc-statistics/issues/2 and with https://github.com/DocNow/twarc/issues/562
@igorbrigadir ok, thanks! I'll have to double check. I just got a new computer and am using the latest twarc-csv. I thought I noticed it pulling in basbtags from the included conversation_id after flattening.
I have been experimenting with the plug in in some datasets and there appears to be an inconsistency with the counting. I am not sure if tweets that contain multiple hashtags are also taken into account.
Here is the code I use (extracting them from the entities metadata):
I get different counts when I apply
twarc2 hashtags sample.jsonl
(just got a random sample of tweets). I usually hashtags with higher counts compared to the twarc command.