ChenWeiyan / LandSCENT

Landscape Single Cell Entropy
19 stars 6 forks source link

net13Jun12.m and net17Jan16.m contain duplicate rows and columns #6

Closed elmbeech closed 4 years ago

elmbeech commented 4 years ago

Dear Chen Weiyan,

I deeply admire your groups research work and really appreciate that you wrote this awesome R package! (Although I would have preferred a python package, but that's a matter of taste.)

I don't know if it matters for calculation, but I realized that your ppi matrix files contain duplicate rows and columns.

I attached tab separated value files with dropped duplicated rows. Maybe they are useful. net13Jun2012.entrez.m.tsv.gz net17Jan2016.entrez.m.tsv.gz

Best, Elmar Bucher

ChenWeiyan commented 4 years ago

Dear Elmar Bucher,

Thanks for your appreciation and the feedback of my package!

I checked the genes in the PPI network matrix in several ways but cannot find such duplications:

  1. I checked the Entrez ID in the matrix with function duplicated in R and there is no duplication in both matrices: which(duplicated(rownames(net13Jun12.m))), and the result showed integer(0) which means none of them are identical.
  2. I also annotated the Enterz ID into Gene Symbol, and then checked the duplications. But there is still no one for net13Jun12.m and only three in net17Jan16.m, which is much less than what you observed.
  3. And I checked the matrices you provided, it seems those genes which are excluded do not show up in the reshaped ones.

So I am wondering how you actually identify such duplications, could you provide more details so that I can help you further?

Best, Weiyan

elmbeech commented 4 years ago

Dear Weiyan,

I see what went wrong. I am sorry for that.

I am not really an R programmer. So I downloaded the network to a tab separated file like this.

library(LandSCENT)
data(net13Jun12.m)
write.table(net13Jun12.m, "net13Jun2012.original.entrez.m.tsv", sep="\t")

And uploaded it into Python3 for mapping the entrenz gene identifier to other gene identifiers.

import pandas as pd
df_net13 = pd.read_csv("net13Jun2012.original.entrez.m.tsv", sep="\t")

Because gene identifier not always map one to one, I used the pandas command:

df_net13 = df_net13.drop_duplicates()

Now drop_duplicates removes all duplicate row, but it ignores the index. So it removed genes that are in the network the same way connected then already another gene.

I am sorry about that! I think I can close this issue. Elmar