Miss-matching counts - Githubissues

luisignaciomenendez commented 2 years ago

I have been experimenting with the plug in in some datasets and there appears to be an inconsistency with the counting. I am not sure if tweets that contain multiple hashtags are also taken into account.

Here is the code I use (extracting them from the entities metadata):

def hash_retrieve(df):
    """
    df : dataframe of tweets
    Description: 
        The function takes as an object a df of tweets obtained via twarc and 
        returns a generator object.

    """

    for line, id in zip(df['entities.hashtags'], df['id']):
        if pd.isna(line):
            continue
        line = line.strip()
        data = json.loads(line)
        for hashtag in ensure_flattened(data):
            #print(hashtag['tag'],id)
            yield [hashtag['tag'], id]

# Using the generator to create a new dataframe.
a = pd.DataFrame(list(hash_retrieve(df)),
                 columns=['hashtag', 'id'])

a.hashtag.value_counts()

I get different counts when I apply twarc2 hashtags sample.jsonl (just got a random sample of tweets). I usually hashtags with higher counts compared to the twarc command.

igorbrigadir commented 2 years ago

Do you have a sample of what your dataframe contains? How is it generated in the first place? It's hard to say or compare it to the code otherwise.

luisignaciomenendez commented 2 years ago

Sure, I tried with a random sample using : twarc2 sample sample.jsonl ( I have also done some extra trials but this is the most inmediate one). I know this is hardly replicable as its using a live stream of tweets but I will try to attach/send you the original file that I have.

Here are the results: (for twarc only those that appear with a count=2)

twarc2 hashtags sample.jsonl

Screenshot 2022-01-25 at 12 55 10

from my code: Screenshot 2022-01-25 at 12 55 37

sample.jsonl.zip

edsu commented 2 years ago

@luisignaciomenendez I think @igorbrigadir means where df comes from in:

# Using the generator to create a new dataframe.
a = pd.DataFrame(list(hash_retrieve(df)),
                 columns=['hashtag', 'id'])

Is df loaded from a CSV generated with twarc2 csv?

luisignaciomenendez commented 2 years ago

@luisignaciomenendez I think @igorbrigadir means where df comes from in:
# Using the generator to create a new dataframe.
a = pd.DataFrame(list(hash_retrieve(df)),
                 columns=['hashtag', 'id'])
Is df loaded from a CSV generated with twarc2 csv?

Yes,exactly. I converted it using twarc2 and then it is loaded with pandas.

edsu commented 2 years ago

I'm a little bit confused by your code but I do think you've found a difference in how twarc-hashtags works and what is in the entities.hashtags column that twarc-csv generates.

It looks like twarc-csv includes not only the tweets that were collected but also tweets that those tweets reference (replies and quotes) or so called "includes".

Personally I would expect to only get hashtags for the tweets that were collected, not the tweets that were referenced. But I guess having an --all flag to get all might be appropriate?

I wonder if users of twarc-csv understand this behavior when using the data though ...

igorbrigadir commented 2 years ago

I think i found what the problem is - It's retweets. twarc-csv processes retweets so that they match what you would expect to find, using the full text of the tweet, not what the json actually contains. So, For a retweet in the json like this:

{
  "entities": {
    "hashtags": [
      {
        "start": 107,
        "end": 115,
        "tag": "EndSARS"
      }
    ]
  },
  "id": "1388203310327508995",
  "referenced_tweets": [
    {
      "type": "retweeted",
      "id": "1388174000472432650"
    }
  ],
  "text": "RT @abjghost: @imoleayomichael was abducted by DSS at 2.30am in his residence and detained for 41days over #EndSARS protest. They still wan…"
}

The retweet is truncated, so only 1 Hashtag is counted by twarc-hashtags: EndSARS

While the twarc-csv code, will dig into the referenced tweet, 1388174000472432650 which is:

{
  "entities": {
    "urls": [
      {
        "start": 280,
        "end": 303,
        "url": "https://t.co/fDgTVvbQBZ",
        "expanded_url": "https://twitter.com/abjghost/status/1388174000472432650/photo/1",
        "display_url": "pic.twitter.com/fDgTVvbQBZ"
      }
    ],
    "mentions": [
      {
        "start": 0,
        "end": 16,
        "username": "imoleayomichael"
      }
    ],
    "hashtags": [
      {
        "start": 93,
        "end": 101,
        "tag": "EndSARS"
      },
      {
        "start": 224,
        "end": 237,
        "tag": "FreeImoleAyo"
      }
    ]
  },
  "id": "1388174000472432650",
  "in_reply_to_user_id": "927129038933626880",
  "text": "@imoleayomichael was abducted by DSS at 2.30am in his residence and detained for 41days over #EndSARS protest. They still want to convict him.\n\nImoleayo is a Programmer NOT A CRIMINAL!\n\nPls lend your voice in solidarity to \n#FreeImoleAyo\nIt could be you or me.\nPls tweet, RT, Tag https://t.co/fDgTVvbQBZ"
}

So it will count 2 hashtags.

A second source of variation is that twarc-hashtags ignores case, while your code is case sensitive, so EndSARS and endsars will be separate for example. Also, ensure_flattened(data) is meant more for handling entire responses not small json objects within tweets, but since the function is robust enough to handle that it's ok to keep using it like that. It simply does not do any thing to the data, so you can leave it out, and have for hashtag in data:

These aren't mistakes or bugs as such, they're just different things that we should be aware of and decide to count one way or another.

Personally, i'm inclined to to edit twarc-hashtags to count the retweeted hashtags same as twarc-csv, and keep it ignoring the case, same as twitter UI. This does mean adding a bit more code but i think it's less surprising to users, becuause if someone were to manually verify a count, they should match.

edsu commented 2 years ago

I think that twarc-csv is including hashtags from tweets that are referenced using conversation_id and also replies? That was the source of one discrepency at least. I thought that twarc-hashtags was counting retweets. If that's not the case it definitely feels like a bug in twarc-hashtags. I'm not sure it makes sense to count hashtags in tweets that are being replied to, quoted etc though -- unless asked to? I might need to think about this. I guess as a user of a hashtag report I'd want to see counts for tweets that I collected, not tweets related to the tweets I collected, but this is a fuzzy area where one tweet begins and ends.

igorbrigadir commented 2 years ago

I think that twarc-csv is including hashtags from tweets that are referenced using conversation_id and also replies?

It used to, but by default in the latest version, no. Just the original tweets merged into the retweets.

Also agree with not counting them from all referenced tweets like replies. Quotes are different though - the quote tweet itself yes, but the quoted tweet? I'm not sure. Right now it will count the quote itself but not the quoted tweet. Still on the fence here too. I guess making command line switches for this will work.

Some of this overlaps with what i was planning with https://github.com/DocNow/twarc-statistics/issues/2 and with https://github.com/DocNow/twarc/issues/562

edsu commented 2 years ago

@igorbrigadir ok, thanks! I'll have to double check. I just got a new computer and am using the latest twarc-csv. I thought I noticed it pulling in basbtags from the included conversation_id after flattening.

DocNow / twarc-hashtags

Miss-matching counts #1