Bioconductor / AnnotationForge

Tools for building SQLite-based annotation data packages
https://bioconductor.org/packages/AnnotationForge
4 stars 9 forks source link

How to deal with KEGG Pathway annotation with replicates #9

Closed guokai8 closed 4 years ago

guokai8 commented 4 years ago

Hi, I just wonder how to deal with the KEGG Pathway annotation file with multiple genes map to one pathway or one genes could map to multiple pathways since the makeOrgPackage won't allow GID had duplicate rows. I also wonder how the official annotation db build with one gene had multiple pathways and one pathway could had different genes. Thanks! Kai

Kayla-Morrell commented 4 years ago

Hello @guokai8,

Sorry for the delay in a response. I was wondering if you might be able to give more detail on what you are trying to achieve, maybe even a reproducible example? I was looking into the Making Organism packages vignette in Annotation Forge, and though it uses GO information it seems possible to have duplicate GID in the files. If you look at this code:

finchGOFile <- system.file("extdata","GO_finch.txt",
               package="AnnotationForge")
fGO <- read.table(finchGOFile,sep="\t")
fGO <- fGO[fGO[,2]!="",]
fGO <- fGO[fGO[,3]!="",]
colnames(fGO) <- c("GID","GO","EVIDENCE")

You can see that there are many duplicated GID,

> head(fGO)
        GID         GO EVIDENCE
1 100190152 GO:0008152      IEA
2 100190152 GO:0016310      IEA
3 100190152 GO:0006222      IEA
4 100190152 GO:0015937      IEA
5 100190152 GO:0000166      IEA
6 100190152 GO:0005524      IEA
> table(duplicated(fGO$GID))

FALSE  TRUE
  346  2814
>

I'm hoping with a bit more detail from you I will be able to help solve the issue.