GangLiLab / genekitr

🧬 Gene analysis toolkit based on R
https://www.genekitr.fun
GNU General Public License v3.0
53 stars 7 forks source link

transId keeping unique ids issue #20

Closed MEladawi closed 1 year ago

MEladawi commented 1 year ago

Hello,

some genes are not changed to the new symbols (the new symbol is BABAM2):

image

Also, the information of the genee is not complete:

image
MEladawi commented 1 year ago

Also, the 7666 genes are retuned as 6100 with unique = T and keepNA = T. Why is that?

reedliu commented 1 year ago

Hi, thanks for your feedback. This issue is because we imported protein ID data from Uniprot, but the gene symbols in Uniprot are mixed. For example, the name BRE alone contains two Uniprot IDs: L8E9D4 and Q96P08. However, there is indeed a bug when dealing with one-to-many mapping. The previous logic for determining this was: if the same symbol appears, then keep the record with the same symbol.

The bug is fixed in version 1.2.4, please try again.

transId(c('BEX1','BRE','BTG2','C14orf169'),'sym',unique = T,keepNA = T)
image

For your second question, could you please provide example data (7666 genes you mentioned) for testing? Because if we use all human symbols in HGNC website, the results are same:

all = vroom::vroom('https://ftp.ebi.ac.uk/pub/databases/genenames/hgnc/tsv/hgnc_complete_set.txt')
all_sym = unique(all$symbol)
length(all_sym)

x = genInfo(all_sym,unique = T,keepNA = T)
table(x$input_id %in% all_sym)
table(all_sym %in% x$input_id)
image
MEladawi commented 1 year ago

Thanks, all fixed!

For #2 that was duplications in my list.