Compute community type - Githubissues

amindadgar commented 1 year ago

We have to compute the community types based on 4 metrics. The metrics are

[ ] Network Density
[ ] Network Modularity
[ ] Network Centralization
[ ] Isolation Fraction

Some definitions are

Density: For undirected network (edge from A to B or from B to A or both is considered as an edge): num_edges / (num_nodes * (num_nodes-1))
Modularity: For modularity we are first detecting communities and then calculating the strength of the communities (modularity)
Centralization: The degree to which the centrality of the most central point exceeds the centrality of all other points” (Freeman, 1979, p. 227).
Isolates: Count of number of nodes with 0 degree in degree centralization output

How we could compute community type is described in the pseudo-code below

If Centralization is High:
    Then the community type is "Audience"
else:
    If Density is High:
        If Modularity is High:
            Then the community type is "Polarized Community"
        else:
            Then the community type is "Tight Community"
     else:
        If Isolation Fraction is Low:
            Then the community type is "Healthy Community"

Question 1: What's the community type in the case of Low Network Density and High Isolation fraction? Question 2: For each user the community is defined with the people that the user has interaction. Is this right?

katerinabc commented 1 year ago

Q1: You could call communities that are low on network density and high/low on isolation fraction as fragmented Community

Source of classification: Check figure 1 https://journals.sagepub.com/doi/full/10.1177/2056305117691545

amindadgar commented 1 year ago

Q1: You could call communities that are low on network density and high/low on isolation fraction as fragmented Community

Source of classification: Check figure 1 https://journals.sagepub.com/doi/full/10.1177/2056305117691545

Thanks for the answer, I'm guessing the article has defined different terms from our document. I will implement it for now and assign a value for each type. Then we could figure out what we want to show to users.

The terms we defined were Audience, Polarized Community, Tight Community, and Healthy Community.

amindadgar commented 1 year ago

More details for the question of the definition of community on Twitter:

The community center is the user (a TwitterAccount node) itself
The tweets that the user made are included in the community (Tweet nodes)
The tweets that the user commented on is included in the community
The retweets that are made for the user's tweet
The quotes that are made for the user's tweet
The replies that are made for the user's tweet
The comments made on the user's tweet
The users that did the actions of 3 to 7 (posting the tweet, doing the retweet, quoting, replying, and commenting)
The users liked the user's tweet (not feasible to count for now as we don't have it in our test data)

Please double check if they are right or we need to include more @TjitsevdM

Update: Reading the original article could give some insights to the network of information flow but it doesn’t fully define a community like a guild as we do have them in discord so as item 1 I’m guessing we should consider the user as the definition of community and items 3 to 7 define the information flow of that community. (Link to the part of original article defining information flow https://journals.sagepub.com/doi/full/10.1177/2056305117691545#:~:text=connections%20among%20twitter%20users%20define%20the%20boundaries%20of%20information%20flow.%20as%20individuals%20and%20organizations%20mention%2C%20retweet%2C%20and%20reply%2C%20they%20create%20networks%20of%20information%20flow.)

TjitsevdM commented 1 year ago

We define community in a different way compared to the paper. We only use the community type definitions from the paper but construct the network in a different way.

Originally, the community consisted of all the tweets from the community center as defined in 1 plus: A. All the tweets from accounts that replied to, quoted, retweeted, liked or mentioned (tweets from) the main account (main account as defined in 1). B. All replies, quotes, retweets, likes, mentions between any of the tweets from the accounts in A C. We would then remove the main node from the network before computing the network metrics since otherwise every node would be connected to the main node by definition which affects the modularity, centralization and isolates.

It seems like this original approach would result in too many accounts in the selection in A. Therefore, we considered using more stringent selection criteria for A. For example, only active interactions like quoting or replying and/or more than 1 interaction. The right threshold was something that we intended to test with the test dataset but this never got completed by Tanusree.

We have also discussed to construct the network of followers only since this would require significantly less API calls. In that case, the network would consist of: X. All accounts that follow the main account Y. All follower connections among the accounts in X Z. We then remove the main account from the network before computing the metrics for the same reason as in C

I think that for now, we can work with the network as defined in the second approach (based on followers). Even though this is not as insightful as looking at actual interactions, it is more easy to achieve with the API constraints that we are currently dealing with.

amindadgar commented 1 year ago

For the case of A, B, and C, I think it's achievable as we would use the tweets that their author is the main account (retweet, reply, quote all has a tweet pointing to another tweet). Also, we could easily remove main account by not considering main twitter account. Generally as we're considering a time window for the community we could say all has to be active in that time window (e.g. interactions in past 7 days) So basically I found that in A, B, and C we're considering just the tweets.

For the case of X, Y, and Z, again we could have it. but in my opinion it does not mean much as it would be just twitter accounts and followers (not any tweets included)

Combining the two items is not easily achievable and efficient as it requires additional graph projections. I think it would be better to use the interactions (reply, quote, mention) and tweets rather than just the account and follower edges said in X, Y, and Z.

To demonstrate more here's the query for A, B, and C (not mentioned and likes included)

OPTIONAL MATCH (source:Tweet {authorId: 1361342551})<-[r:QUOTED|REPLIED|RETWEETED]-(target:Tweet)
WHERE 
    m.authorId <> 1361342551  // don't include the self activities
    AND 
    r.createdAt >= last_7_days_timestamp // a float value for last 7 days
RETURN source, target // Here we're getting the source and target nodes of the community

If I want to include the mentioned, I have to create a temporary Tweet node which acts as the mentioned TwitterAccount node so they would be included (same for likes if we had the test data updated). For example, the query would be something like this (and creating the edge named ALIAS_MENTIONED between the newly created Tweet)

OPTIONAL MATCH (source:Tweet {authorId: 1361342551})<-[r:QUOTED|REPLIED|RETWEETED|ALIAS_MENTIONED]-(target:Tweet)
WHERE 
    m.authorId <> 1361342551  // don't include the self activities
    AND 
    r.createdAt >= last_7_days_timestamp // a float value for last 7 days
RETURN source, target // Here we're getting the source and target nodes of the community

Let me know if this solution is right @TjitsevdM

TjitsevdM commented 1 year ago

Yes, I'm sure it will be fine on the Neo4j side but I meant the rate limit on the Twitter API. There is a limit in how many tweets we can extract per month and the first method requires us to extract tweets from the main account and all accounts that interact with it. That might work for a few communities but is harder to scale. The followers solution is indeed less informative but requires less twitter API calls.

amindadgar commented 1 year ago

Yes, I'm sure it will be fine on the Neo4j side but I meant the rate limit on the Twitter API. There is a limit in how many tweets we can extract per month and the first method requires us to extract tweets from the main account and all accounts that interact with it. That might work for a few communities but is harder to scale. The followers solution is indeed less informative but requires less twitter API calls.

Yes, sure. So just to double check, we would just assume the user and its followers as a community. More specifically we would just consider the TwitterAccount nodes and FOLLOWS relations. Is this right?

Note: I have inserted a part of that 2.1 Gb data into my local computer to do the analysis on and in this part I'm not seeing any FOLLOWS relation. Maybe it is not included in Sepehr's script or maybe we didn't have it at all in the json file that was given to him. I'll reach out to him to see the status.

TjitsevdM commented 1 year ago

I don't think followers were part of the last script from Tanusree (so not in the JSON file). If we agree to go this way, rather than the original way, we'll have to update the extraction script to extract the followers of all the followers of an account.

TjitsevdM commented 1 year ago

And yes, you are correct in your interpretation

amindadgar commented 1 year ago

I don't think followers were part of the last script from Tanusree (so not in the JSON file). If we agree to go this way, rather than the original way, we'll have to update the extraction script to extract the followers of all the followers of an account.

It is easier to implement the definition of community as TwitterAccount and followers but in my opinion, this is not right due to computing the other metrics based on other interactions (retweet, mention, quote, etc). Other than that I don't have the data to write down the computations for now. For API calls, I'm not sure defining the community based on followers could reduce them since we still need the other metrics computed based on the tweets and the other interactions. So my opinion is to use other interactions (the plan A, B, C in your comment above).

I'll share a summary of our discussion here with other team members to see their ideas, but for now, it seems I'm blocked on this metric (also if you had more ideas feel free to share them here).

TjitsevdM commented 1 year ago

I agree that this does not optimally align with the other metrics that are based on interactions but we got to work with the constraints from the twitter API. You can get the followers per account with a single API call per account. You can get the interactions with 2 API calls per tweet per account. A network of all followers will have more nodes than a network of all interacting accounts. The options I see are (from most API calls to least):

Original method making network from interactions between all interacting accounts
Making network from follows between all followers (whether this is more or less calls per main account is tbd)
Making network from follows between all interacting accounts

amindadgar commented 1 year ago

For now, we have no access for the followers data in Twitter API. So we removed the features related to this and now closing the issue.

TogetherCrew / twitter-analytics

Compute community type #4