Aghasemian / CommunityFitNet

This page is a companion for our paper on overfitting and underfitting of community detection methods on real networks, written by Amir Ghasemian, Homa Hosseinmardi, and Aaron Clauset. (arXiv:1802.10582)
25 stars 5 forks source link

Some bipartite networks have problematic node indexing #3

Open count0 opened 4 years ago

count0 commented 4 years ago

I noticed that most 2-mode "Norwegian Board of Directors", which are supposed to be bipartite, actually contains odd-length cycles. For example, in data

Norwegian_Board_of_Directors_net2mode_2009-10-01

The following triangle exists:

[108 284 289]

Other data from this series do not contain triangles, but higher odd-length cycles do exist. Only a minority of them are actually bipartite.

The primary data, downloaded from the original website, seems to have the same problem... It seems the node indexes in the two modes (director and board) can repeat.

Aghasemian commented 4 years ago

Thanks Tiago for letting us know. I also checked the original data at "http://www.boardsandgender.com/data.php" . We changed the node indices, since in the data we have provided here, we just considered the largest component. In the original dataset, the nodes with the following indices (73, 146, 5134) construct the same triangle!

count0 commented 4 years ago

Indeed, it's a problem with the original dataset! It seems that the two kinds of nodes (director and board) can have the same index, which is error prone.

Thanks for fixing.

(Note that by fixing the index it changes quite a bit the largest component of some of these networks, since they are very sparse.)

Aghasemian commented 4 years ago

Interesting! I didn't know about that. Can you give me a reference? How changing the indices can change the largest component? Is it an algorithmic issue?

count0 commented 4 years ago

The repeated indexes cause the number of nodes to be smaller, since it merges different nodes together (thus destroying the bipartite property). Thus fixing this problem increases the number of nodes, while keeping the number of edges constant.

For example, for the data Norwegian_Board_of_Directors_net2mode_2009-10-01, we have N=1332 nodes before fixing the indexes, and N=1729 afterwards, while E=1465 remains the same.

As a result of the sparsification, the number of components jumps from 1 to 278!

MateJozsaPhys commented 4 years ago

72 % of the bipartite graphs are not bipartite after the edge list given.

Aghasemian commented 4 years ago

Yes, as Tiago pointed out before, all “Norwegian_Board_of_Directors_net2mode…” (111 networks — network ids = 254–364) and one network called “Aishihik_Lake_host-parasite_web_Aishihik_Lake_host-parasite_web” (network id = 0) have this indexing issue. Then for totally 112 networks out of 572 networks, their source had this issue. We fixed that in our new publication regarding optimal link prediction. I will attach the corrected version of networks soon.

MateJozsaPhys commented 4 years ago

Thank you! I am waiting the corrected version!

MateJozsaPhys commented 4 years ago

“Norwegian_Board_of_Directors_net2mode…” after the "graphProperties" column are the projected networks. What I pointed out is that after the "graphProperties" column, where the "Bipartite" property appears, 113 from 157 cases are not Bipartite after the indexing. I didn't checked the Projected graphs, but if this is true then 112+113 = 225 networks out of 572 has problem.

Aghasemian commented 4 years ago

No, the projected graphs are not projected by us and they are from the original source. The issue is related to indexing in some of the bipartite graphs that I mentioned. The "Norwegian_Board_of_Directors_net2mode" family of networks and one network called “Aishihik_Lake_host-parasite_web_Aishihik_Lake_host-parasite_web” had this issue that are fixed and I will attach the corrected version very soon. Thanks!

MateJozsaPhys commented 4 years ago

Indeed. There are also the projected and the bipartite versions in the dataframe. Thank you!

Aghasemian commented 4 years ago

Sorry about the delay. I updated the dataframe.