digitalmethodsinitiative / dmi-tcat

Digital Methods Initiative - Twitter Capture and Analysis Toolset
Apache License 2.0
367 stars 114 forks source link

no_mentions, no_tweets discrepancy #449

Closed eeftychiou closed 2 years ago

eeftychiou commented 2 years ago

Describe the bug Mention graph no_tweets seems to be wrong.

To Reproduce Steps to reproduce the behavior:

  1. Attached CSV exported from DMI-TCAT contains 4 tweets containing @garvtoilets garvtoilets----------no--fullExport--481dfcf681.csv

  2. These tweets result in the following mentions table 693 1391641638368432131 2021-05-10 06:29:56 SaathiPads 255110060 GarvToilets 73147209 3188 1389886062114209797 2021-05-05 10:13:54 GarvToilets 73147209 GarvToilets 73147209 3189 1389886062114209797 2021-05-05 10:13:54 GarvToilets 73147209 GarvToilets 73147209 3190 1389886062114209797 2021-05-05 10:13:54 GarvToilets 73147209 SaathiPads 255110060 3191 1389886062114209797 2021-05-05 10:13:54 GarvToilets 73147209 BempuHealth 2561150652 3378 1389694893530943491 2021-05-04 21:34:16 ThomasGass 2370597494 GarvToilets 73147209 3697 1389542622415499265 2021-05-04 11:29:12 UNDPGeneva 971765917 GarvToilets 73147209

  3. The sql join producing the mention graph results in the following From To SaathiPads GarvToilets GarvToilets GarvToilets GarvToilets GarvToilets GarvToilets SaathiPads GarvToilets BempuHealth ThomasGass GarvToilets UNDPGeneva GarvToilets

  4. Mention graph export results in no_tweets=4, no_mentions=5

Expected behavior The user garvtoilets retweeted just once the original tweet from UNDPGeneva. Its tweet count should be 1 and not 4. Additionally, the mention count should be 4 instead of 5.

dale-wahl commented 2 years ago

Hello eeftychiou, would you mind telling me how you collected your data? Did you use the capture function or did you import your own data via another method?

Correct me if I am wrong. It looks like you have filtered your data to include only tweets containing the user @garvtoilets and created the csv via "Export all tweets from selection". That does provide you with 4 tweets: the original tweet by UNDPGeneva and three retweets (from ThomasGass, SaathiPads, and GarvToilets themself). So I believe 4 tweets is the expected response here.

The mentions appear off to me. UNDPGeneva mentions GarvToilets and each retweeter also mentions GarvToilets. It looks like it does doubly count GarvToilets when he retweets a tweet that already mentions him. I would thus conclude that GarvToilets is mentioned 4 times (not 5). I think that is erroneous which is why I ask how you collected your data.

Let me know if that evaluation is consistent with your experience.

eeftychiou commented 2 years ago

Hi Dale,

The data was gathered using 4cat, then exported and imported in tcat using the import function. I then used tcat to filter and export all of the tweets containing the text garvtoilets resulting in the attached CSV.

The rest of the data presented above are taken from the tcat database using sql queries. Point 2 from/to user = garvtoilets in mentions table and the other dataset in point 3 i used the join statement in the mention graph module used to build the network.

The no_tweets is also wrong given the fact that garvtoilets only tweeted once. Indeed the root of the problem must be in the way the tweets are imported in tcat from the 4cat json export. I will try later today to re-import the json file with only the garvtoilets tweets to see if the problem can be reproduced.

dale-wahl commented 2 years ago

If you would like to filter for specifically for tweets from garvtoilets, you should use the "From user" criteria. There are 4 tweets that contain the text "garvtoilets" because the other three tweets list that username.

I believe I have identified the bug causing one extra mention than expected. I will look at a patch for that soon. Thank you for identifying it. Re-importing the data will still have two user mentions where garvtoilets mentions themselves.

eeftychiou commented 2 years ago

Hi Dale,

I am attaching the json containing all the tweets relevant to the issue. test.json.csv

I imported it in TCAT and exported the graph network replicating the issue. The resulting GDF is pasted verbatim below. Besides the mention count which we agree should be 4, I believe the tweet count should be 1, as garvtoilets only did a retweet of the original tweet by the undpgeneva.

Additionally, the edge weights in the graph export also seem to be wrong. @undpgeneva tweeted mentioning @GarvToilets @SaathiPads and @BempuHealth @GarvToilets @SaathiPads retweeted the original tweet @thomasgass Retweeted the tweet

here is the resulting GDF

nodedef>name VARCHAR,label VARCHAR,no_tweets INT,no_mentions INT 0,undpgeneva,3,0 1,garvtoilets,4,5 2,saathipads,4,5 3,bempuhealth,0,4 4,thomasgass,4,1 edgedef>node1 VARCHAR,node2 VARCHAR,weight DOUBLE,directed BOOLEAN 0,1,1,true 0,2,1,true 0,3,1,true 4,4,1,true 4,1,1,true 4,2,1,true 4,3,1,true 1,1,2,true 1,2,1,true 1,3,1,true 2,2,2,true 2,1,1,true 2,3,1,true

Why does @thomasgass have a self mention on the graph export? @Garvtoilets and @Saathipads correctly have self mentions but with a weight of 2? Why is that?

Thanks, Efty

dale-wahl commented 2 years ago

@eeftychiou when you are referring to no_tweets, what exactly are you referring to? As reported by the "Tweet stats" analysis? Retweets are tweets as far as TCAT is concerned. You can filter them out if you wish. Or you can filter using "From user:" by the user you are interested in; filtering that way should return 1 tweet in your example of 4 tweets.

@ErikBorra per a note in the code here, it appears that we are attempting to add the "retweeting user" to the mentions table. I think, the note is wrong and we actually want to add the retweetED user to the mentions table. Can you confirm the desired effect? My guess is that this is only occurring with the 4CAT imports and is a matter of misinterpreting that note when designing the json from 4CAT.

eeftychiou commented 2 years ago

@dale-wahl no_tweets is a field in the mention graph export file,(gdf) which can be imported into Gephi. It is the file you get when you go to Analysis -> Networks -> Social graph by mentions

The content of the file is the one I pasted in the previous post. See below nodedef with no_tweets bolded nodedef>name VARCHAR,label VARCHAR,no_tweets INT,no_mentions INT

dale-wahl commented 2 years ago

Thank you, @eeftychiou! That was very helpful. Yes, I believe you are correct. It is not number of tweets, but number of times they mentioned other users. no_tweets -> counts how many times they mention other users no_mentions -> counts how many times they are mentioned by other users

@ErikBorra do you know what the intended behavior would be? Does it makes more sense to clarify by renaming no_tweets (which would ensure the analysis is comparable with prior analysis using this method) or fix it so that it actually counts the number of tweets by a user in a dataset?

ErikBorra commented 2 years ago

no_tweets is the number of tweets in the data set; in this case a subset of the query bin that holds only tweets that mention somebody.

A mention table should always have two users: a source node (who is (re-)tweeting) and a target node (who is mentioned).

dale-wahl commented 2 years ago

This commit clarifies and renames no_tweets to mentioned_others and no_mentions to mentioned_by_others.

Changes in progress to 4CAT export to remove self mentions.

eeftychiou commented 2 years ago

It does not make sense for the no_tweets to be the "the number of tweets in the data set; in this case a subset of the query bin that holds only tweets that mention somebody" because that is a feature that can be determined by the network. There is no value having that field in the gdf. From my understanding, what you are describing is the in-degree and out-degree of a network node.

On the other hand the actual number of tweets of the user within the specified subset of the query bin is more appropriate as it is information that cannot be determined by the network topology. For example, a tweeter user may tweet 100 times without mentioning anyone, hence no edges from its node to other nodes and tweets 1 time mentioning one other user. His total tweets would be 101, his in degree should be 0 and his out-degree should be 1.

dale-wahl commented 2 years ago

Fixed 4CAT export issue causing original author to "mention" themselves in retweets in TCAT export.

@eeftychiou You will need to update 4CAT and re-export your data. Or otherwise remove the author from the retweet mentions object in the ndjson file. You can also get the number of tweets per user in the "Tweet statistics and activity metrics" -> "User stats (individual)" analysis module if you are still interested in that.

So far as I can tell, the no_tweets field has existed and presented the same data (the number of times a user mentioned another user) since at least 2013 (at least assuming the database was structured the same), so good catch on the mislabeled field. If number of tweets being added to the mention_graph is something in which users are interested, you could add a request for that feature. Changing what an existing field actually does may not be desirable by some who might wish to reproduce prior results.

eeftychiou commented 2 years ago

@ErikBorra can you please confirm the changes made by @dale-wahl ?

In your post you say that _notweets is the the number of tweets in the data set mentioning others. If this is the case then the expected value for garvtoilets would be 1 since he (re)tweeted only once. While the GDF contains a value of 4 which is wrong. to

The changes made by @dale-wahl essentially change this to number of mentions from a specific user. So a single tweet containing 10 mentions would imply no_tweets=10. Is this the expected/desired behavior?

ErikBorra commented 2 years ago

I was mistaken. I looked into the code together with Dale. no_tweets should indeed be called mentioned_others

Intuitively it makes sense if you consider the network graph, where the user with a tweet containing 10 mentions will be a node with 10 directed edges towards the 10 other users mentioned in that tweet.

eeftychiou commented 2 years ago

Thanks Erik for the clarification. If that is correct then there is no use to include that information as it is already part of the network topology and can be determined through the network, in degree out degree etc.

Additionally what is the use for the join in the code as the information is already available in the mentions table. $sql = "SELECT m.from_user_name COLLATE $collation as from_user_name, m.to_user COLLATE $collation as to_user FROM " . $esc['mysql']['dataset'] . "_mentions m, " . $esc['mysql']['dataset'] . "_tweets t "; $where = "m.tweet_id = t.id AND "; $sql .= sqlSubset($where); $sql .= " LIMIT " . $cur . "," . $numresults;

On the other hand if the no_tweets was in fact number of tweets with mentions makes more sense since it is information that cannot be determined by the network.

Anyway I will leave it at that.

eeftychiou commented 2 years ago

Sorry to come back to this but I think we have not completely resolved this. Ignoring the issue of mentioned and no_tweets there are other issues concerning the final mention graph exported by TCAT. I did a complete reinstall of 4cat and updated tcat with the latest commits.

As stated before the following tweets are taking place @undpgeneva tweeted mentioning @GarvToilets @SaathiPads and @BempuHealth --> Red Arrows @GarvToilets retweeted the original tweet --> Blue Arrows @SaathiPads retweet the original tweet --> Green Arrows @thomasgass Retweeted the tweet --> Yellow Arrows

Below you can see the resulting mention graph Blank diagram

And here is the resulting mention graph export from tcat nodedef>name VARCHAR,label VARCHAR,mentioned_others INT,mentioned_by_others INT 0,undpgeneva,3,0 1,garvtoilets,4,5 2,saathipads,4,5 3,bempuhealth,0,4 4,thomasgass,4,1 edgedef>node1 VARCHAR,node2 VARCHAR,weight DOUBLE,directed BOOLEAN 0,1,1,true 0,2,1,true 0,3,1,true 4,4,1,true 4,1,1,true 4,2,1,true 4,3,1,true 1,1,2,true 1,2,1,true 1,3,1,true 2,2,2,true 2,1,1,true 2,3,1,true

The tcat export has the following discrepancies

  1. @undpgeneva node definition has 0 mentioned by others even though he was retweeted three times (garvtoilets,saathipads,thomasgass)
  2. @garvtoilets was only mentioned_by_others 4 times and not 5 as in the node definition. Additionally, has an edge weight of 2 as a self mention which is evidently wrong.
  3. @saathipads has exactly the same problem with @garvtoilets
  4. @thomasgass retweeted the undpgeneva tweet and was not mentioned by the other users. So the mentioned_by_others is wrong.
  5. @thomasgass has a self mention edge 4,4,1 which is incorrect.

Here is the json 4cat export imported into TCAT

garvV3.json.txt

Here is the resulting mention table in the database which is causing the problem. 1389542622415499265 UNDPGeneva GarvToilets 1389542622415499265 UNDPGeneva SaathiPads 1389542622415499265 UNDPGeneva BempuHealth 1389694893530943491 ThomasGass ThomasGass 1389694893530943491 ThomasGass GarvToilets 1389694893530943491 ThomasGass SaathiPads 1389694893530943491 ThomasGass BempuHealth 1389886062114209797 GarvToilets GarvToilets 1389886062114209797 GarvToilets GarvToilets 1389886062114209797 GarvToilets SaathiPads 1389886062114209797 GarvToilets BempuHealth 1391641638368432131 SaathiPads SaathiPads 1391641638368432131 SaathiPads GarvToilets 1391641638368432131 SaathiPads SaathiPads 1391641638368432131 SaathiPads BempuHealth

To conclude it seems the changes in 4cat did not have the desired effect.

dale-wahl commented 2 years ago

I took your file (renamed it to end with .json) and used the import/import-jsondump.php to upload it to a TCAT instance. Here are my results:

nodedef>name VARCHAR,label VARCHAR,mentioned_others INT,mentioned_by_others INT
0,saathipads,4,4
1,undpgeneva,3,3
2,garvtoilets,4,4
3,bempuhealth,0,4
4,thomasgass,4,0
edgedef>node1 VARCHAR,node2 VARCHAR,weight DOUBLE,directed BOOLEAN
0,1,1,true
0,2,1,true
0,0,1,true
0,3,1,true
2,1,1,true
2,2,1,true
2,0,1,true
2,3,1,true
4,1,1,true
4,2,1,true
4,0,1,true
4,3,1,true
1,2,1,true
1,0,1,true
1,3,1,true

My guess is that you need to remove the old bin or import your JSON as a new bin in TCAT.

It is also worth noting the the Twitter API does not consider mentions in a retweet as mentioned by the retweeting user. E.g., If I retweet something like "RT @eeftychiou: thank you for helping me @dale-wahl", Twitter only lists you (@eeftychiou) as a mention by me (@dale-wahl). That is how 4CAT handles mentions if you were to use the "Custom network" in 4CAT and choose the author and mention columns. TCAT does include all mentions (both @eeftychiou and @dale-wahl in the previously mentioned example). Which is I think what you are looking for when you say "desired effect".

eeftychiou commented 2 years ago

Thank you very much @dale-wahl . You are absolutely right, I in fact loaded an old json file into a new bin, hence the incorrect results.