Closed amindadgar closed 1 year ago
Q1: You could call communities that are low on network density and high/low on isolation fraction as fragmented Community
Source of classification: Check figure 1 https://journals.sagepub.com/doi/full/10.1177/2056305117691545
Q1: You could call communities that are low on network density and high/low on isolation fraction as fragmented Community
Source of classification: Check figure 1 https://journals.sagepub.com/doi/full/10.1177/2056305117691545
Thanks for the answer, I'm guessing the article has defined different terms from our document. I will implement it for now and assign a value for each type. Then we could figure out what we want to show to users.
The terms we defined were Audience
, Polarized Community
, Tight Community
, and Healthy Community
.
More details for the question of the definition of community on Twitter:
Please double check if they are right or we need to include more @TjitsevdM
Update: Reading the original article could give some insights to the network of information flow but it doesn’t fully define a community like a guild as we do have them in discord so as item 1 I’m guessing we should consider the user as the definition of community and items 3 to 7 define the information flow of that community. (Link to the part of original article defining information flow https://journals.sagepub.com/doi/full/10.1177/2056305117691545#:~:text=connections%20among%20twitter%20users%20define%20the%20boundaries%20of%20information%20flow.%20as%20individuals%20and%20organizations%20mention%2C%20retweet%2C%20and%20reply%2C%20they%20create%20networks%20of%20information%20flow.)
We define community in a different way compared to the paper. We only use the community type definitions from the paper but construct the network in a different way.
Originally, the community consisted of all the tweets from the community center as defined in 1 plus: A. All the tweets from accounts that replied to, quoted, retweeted, liked or mentioned (tweets from) the main account (main account as defined in 1). B. All replies, quotes, retweets, likes, mentions between any of the tweets from the accounts in A C. We would then remove the main node from the network before computing the network metrics since otherwise every node would be connected to the main node by definition which affects the modularity, centralization and isolates.
It seems like this original approach would result in too many accounts in the selection in A. Therefore, we considered using more stringent selection criteria for A. For example, only active interactions like quoting or replying and/or more than 1 interaction. The right threshold was something that we intended to test with the test dataset but this never got completed by Tanusree.
We have also discussed to construct the network of followers only since this would require significantly less API calls. In that case, the network would consist of: X. All accounts that follow the main account Y. All follower connections among the accounts in X Z. We then remove the main account from the network before computing the metrics for the same reason as in C
I think that for now, we can work with the network as defined in the second approach (based on followers). Even though this is not as insightful as looking at actual interactions, it is more easy to achieve with the API constraints that we are currently dealing with.
For the case of A, B, and C, I think it's achievable as we would use the tweets that their author is the main account (retweet, reply, quote all has a tweet pointing to another tweet). Also, we could easily remove main account by not considering main twitter account. Generally as we're considering a time window for the community we could say all has to be active in that time window (e.g. interactions in past 7 days) So basically I found that in A, B, and C we're considering just the tweets.
For the case of X, Y, and Z, again we could have it. but in my opinion it does not mean much as it would be just twitter accounts and followers (not any tweets included)
Combining the two items is not easily achievable and efficient as it requires additional graph projections. I think it would be better to use the interactions (reply, quote, mention) and tweets rather than just the account and follower edges said in X, Y, and Z.
To demonstrate more here's the query for A, B, and C (not mentioned and likes included)
OPTIONAL MATCH (source:Tweet {authorId: 1361342551})<-[r:QUOTED|REPLIED|RETWEETED]-(target:Tweet)
WHERE
m.authorId <> 1361342551 // don't include the self activities
AND
r.createdAt >= last_7_days_timestamp // a float value for last 7 days
RETURN source, target // Here we're getting the source and target nodes of the community
If I want to include the mentioned, I have to create a temporary Tweet
node which acts as the mentioned TwitterAccount
node so they would be included (same for likes
if we had the test data updated). For example, the query would be something like this (and creating the edge named ALIAS_MENTIONED
between the newly created Tweet)
OPTIONAL MATCH (source:Tweet {authorId: 1361342551})<-[r:QUOTED|REPLIED|RETWEETED|ALIAS_MENTIONED]-(target:Tweet)
WHERE
m.authorId <> 1361342551 // don't include the self activities
AND
r.createdAt >= last_7_days_timestamp // a float value for last 7 days
RETURN source, target // Here we're getting the source and target nodes of the community
Let me know if this solution is right @TjitsevdM
Yes, I'm sure it will be fine on the Neo4j side but I meant the rate limit on the Twitter API. There is a limit in how many tweets we can extract per month and the first method requires us to extract tweets from the main account and all accounts that interact with it. That might work for a few communities but is harder to scale. The followers solution is indeed less informative but requires less twitter API calls.
Yes, I'm sure it will be fine on the Neo4j side but I meant the rate limit on the Twitter API. There is a limit in how many tweets we can extract per month and the first method requires us to extract tweets from the main account and all accounts that interact with it. That might work for a few communities but is harder to scale. The followers solution is indeed less informative but requires less twitter API calls.
Yes, sure. So just to double check, we would just assume the user and its followers as a community. More specifically we would just consider the TwitterAccount
nodes and FOLLOWS
relations. Is this right?
Note: I have inserted a part of that 2.1 Gb data into my local computer to do the analysis on and in this part I'm not seeing any FOLLOWS
relation. Maybe it is not included in Sepehr's script or maybe we didn't have it at all in the json file that was given to him. I'll reach out to him to see the status.
I don't think followers were part of the last script from Tanusree (so not in the JSON file). If we agree to go this way, rather than the original way, we'll have to update the extraction script to extract the followers of all the followers of an account.
And yes, you are correct in your interpretation
I don't think followers were part of the last script from Tanusree (so not in the JSON file). If we agree to go this way, rather than the original way, we'll have to update the extraction script to extract the followers of all the followers of an account.
It is easier to implement the definition of community as TwitterAccount and followers but in my opinion, this is not right due to computing the other metrics based on other interactions (retweet, mention, quote, etc). Other than that I don't have the data to write down the computations for now. For API calls, I'm not sure defining the community based on followers could reduce them since we still need the other metrics computed based on the tweets and the other interactions. So my opinion is to use other interactions (the plan A, B, C in your comment above).
I'll share a summary of our discussion here with other team members to see their ideas, but for now, it seems I'm blocked on this metric (also if you had more ideas feel free to share them here).
I agree that this does not optimally align with the other metrics that are based on interactions but we got to work with the constraints from the twitter API. You can get the followers per account with a single API call per account. You can get the interactions with 2 API calls per tweet per account. A network of all followers will have more nodes than a network of all interacting accounts. The options I see are (from most API calls to least):
For now, we have no access for the followers data in Twitter API. So we removed the features related to this and now closing the issue.
We have to compute the community types based on 4 metrics. The metrics are
Some definitions are
How we could compute community type is described in the pseudo-code below
Question 1: What's the community type in the case of Low Network Density and High Isolation fraction? Question 2: For each user the community is defined with the people that the user has interaction. Is this right?