anvaka / sayit

Visualization of related subreddits
https://anvaka.github.io/sayit
MIT License
1.25k stars 106 forks source link

Jaccard Similarity Biased on Bipartite Graphs #8

Open AlexMRuch opened 5 years ago

AlexMRuch commented 5 years ago

You should consider using Standardized Co-incident Ratio (SCR) as a measure of similarity instead of the Jaccard Similarity, as the latter is biased in large bipartite graphs whereas the former is not: https://yongrenshi.weebly.com/uploads/5/7/8/6/57861243/ssr_structuralsimilarity.pdf. Should be a simple switch.

anvaka commented 5 years ago

Thank you Alex. I'll check this out in more details.

I briefly skimmed over the article, got puzzled by their definition of Jaccard Index - they say

Jij = d / (b + c + d)

Where

I'd expect denominator to be (b + c - d) not (b + c + d). It is just a typo or they used incorrect Jaccard similarity definition?

anvaka commented 5 years ago

Also a quick follow up question, in case I'm missing it. How do you compute E(d) - the number expected in an otherwise identical randomized network?

AlexMRuch commented 5 years ago

If you’re using the affiliation matrix, then it is d/(b+c+d) because c is nodes that are in both sets and is already removed from the “or” combination. See the “Similarity of asymmetric binary attributes” figure in https://en.wikipedia.org/wiki/Jaccard_index https://en.wikipedia.org/wiki/Jaccard_index.

For E(d), I’m guessing they either used a random graph generator (e.g., https://networkx.github.io/documentation/networkx-1.10/reference/generators.html https://networkx.github.io/documentation/networkx-1.10/reference/generators.html) with parameters equal to the empirical graph or they used one of the methods listed in the following:

“The difference is the denominator. Instead of comparing the number of co-followers with the number of followers (measured either as the union or the geometric mean), SCR controls for the expected increase in co-following among high-degree followers by taking as a baseline the number of co-followers expected by chance in an otherwise identical network. The randomization procedure preserves the original in-degree and outdegree of every node (Zweig and Kauffman, 2011; Gionis et al., 2007; Neal, 2014).”

Hope that helps!

On Aug 29, 2019, at 1:02 PM, Andrei Kashcha notifications@github.com wrote:

Also a quick follow up question, in case I'm missing it. How do you compute E(d) - the number expected in an otherwise identical randomized network?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/anvaka/sayit/issues/8?email_source=notifications&email_token=AFIYWOI6ZT222BPOOBAMR3DQG76LBA5CNFSM4IQGW7HKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5PFDRQ#issuecomment-526275014, or mute the thread https://github.com/notifications/unsubscribe-auth/AFIYWOPFUU7BSLI5BSYVDBDQG76LBANCNFSM4IQGW7HA.

anvaka commented 5 years ago

The difference is the denominator

It is very interesting. Any chance you can ELI5 this to me? Given the reddit network of posts and comments, what should be the E(d) function? How do I compute it efficiently?

AlexMRuch commented 5 years ago

Sure thing!

So social networks often follow what’s called a power law distribution, which basically means that lots of nodes in the network have a small/moderate degree of connectivity to other nodes, but a small number of nodes have a very high degree. If you think of Reddit, for example, a lot of people only post 1-2 comments a month; however, some people comment on hundreds or even thousands of subreddits. Because the distribution of these degrees of connectivity is highly skewed with a heavy right tail, they throw off the estimate of the network’s average degree (i.e., the estimated average level of connectivity in the graph will be higher than it really should be). This is the same reason why we often report the median of these distributions, as it’s not affected by skewed distributions.

The paper shows that the heavily skewed degree of connectivity in large networks biases many measures of similarity in networks (e.g., Euclidean, Pearson, Cosine, Jaccard, etc.). The bias happens because the out degree of the very highly connected nodes overemphasizes some of the weight in the denominators of these measures. The standardized co-incidence ratio, however, excludes this bias as it compares the degrees to what is expected at random.

Fortunately, if you want to get similarity measures for your graph, you’ll only need to generate one random graph. You can use this tutorial and adjust the number of nodes (n) and edges (m) to the number of nodes in your network: https://networkx.github.io/documentation/stable/auto_examples/graph/plot_erdos_renyi.html https://networkx.github.io/documentation/stable/auto_examples/graph/plot_erdos_renyi.html. After you’ve generated your random network, you’ll only have to calculate E(d) once, which will return a number that equals the average number of nodes that any two pairs of nodes both share edges to (i.e., how many nodes on average do any pair of nodes in the network share as common neighbors).This will give you a number, like 10.23 or 5.81 or 27.22, depending on how sparse or dense the network is. Again, this number just represents the average number of common neighbors that any two nodes share.

Once you have this number, you can update your similarity estimate script to set E(d) to the number you estimated and then re-estimate the similarities between each of your nodes using the d/E(d) SCR measure. It should give you a matrix just like your Jaccard method gives you.

On Aug 29, 2019, at 3:23 PM, Andrei Kashcha notifications@github.com wrote:

The difference is the denominator

It is very interesting. Any chance you can ELI5 this to me? Given the reddit network of posts and comments, what should be the E(d) function? How do I compute it efficiently?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/anvaka/sayit/issues/8?email_source=notifications&email_token=AFIYWOLKEJHI5RMMSQAEZK3QHAO2BA5CNFSM4IQGW7HKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5PRSLI#issuecomment-526326061, or mute the thread https://github.com/notifications/unsubscribe-auth/AFIYWOM4ZF2OMYWVHEIASJDQHAO2BANCNFSM4IQGW7HA.

no-identd commented 4 years ago

Any update on this? Would be interesting to see how this differs.