VCCRI / Ularcirc

An R-shiny app that provides backsplice and canonical splicing analysis for both circular RNA (circRNA) and parental transcripts
GNU General Public License v3.0
15 stars 7 forks source link

Bug reports when using Ensembl ID based de novo assembly transcriptome for the species other than human #20

Closed wong-ziyi closed 2 years ago

wong-ziyi commented 2 years ago

Hi, Thanks so much for making such amazing software.

I have some problems when I was using my de novo assembly transcriptome with ENSEMBL reference of the species other than human.

In the file Ularcirc/inst/shiny-app/circRNA/Server.R on the line 269 and 1338, the regular expression, ^ENSG([-0-9]+) and ^ENSG([-0-9]+), is only validate for Human case.

The Ensembl ID format is like ENS\<species prefix>\<feature type prefix>\<a unique eleven digit number>. For example, ENSMUSG00000017167 is a ENSEMBL stable ID of a mouse gene.

Therefore, I changed the original Ensembl ID test to the follows:

for the line 269:

  test0 <- gsubfn::strapplyc(as.character(GeneName),pattern="^(ENS[[:alpha:]]*).*")
  test <- gsubfn::strapplyc(as.character(GeneName),pattern=paste0("^",test0[[1]],"([-0-9]+)"))
  if (length(test[[1]]) > 0)  # Ensembl ID

for the line 1338:

 ensembl_IDs <- gsubfn::strapplyc(as.character(ensembl_IDs),"^ENS[0-9]+")

Also, when I was tring to use my custome BSgenome and TxDb packages based on the de novo assembly transcriptome with ENSEMBL reference, I got the similar errors. Because the original function “Gene_Transcript_Features” is based on ENTREZID but my circexplorer2 is based on ENSEMBL ID.

Therefore, I added a ENSEMBL ID test in the “Gene_Transcript_Features” function as below:

for the line 450:

  test0 <- gsubfn::strapplyc(as.character(Gene_Symbol),pattern="^(ENS[[:alpha:]]*).*")
  test <- gsubfn::strapplyc(as.character(Gene_Symbol),pattern=paste0("^",test0[[1]],"([-0-9]+)"))
  if (length(test[[1]]) > 0)  # Ensembl ID
  {
    ensembl_gene <- paste(test0[[1]],test[[1]],sep="")
    a <- select(GeneList$Annotation_Library, keys = Gene_Symbol, columns=c("ENTREZID", "SYMBOL", "ENSEMBL"),keytype="ENSEMBL")
    if("EXONRANK"%in%keytypes(GeneList$transcript_reference)){
      b <- select(GeneList$transcript_reference, keys = a$ENSEMBL, columns=c('GENEID', 'TXNAME'),keytype="GENEID")
    }else{
      b <- select(GeneList$transcript_reference, keys = a$ENSEMBL, columns=c('GENEID', 'TXNAME', 'EXONRANK'),keytype="GENEID")
    }
  }
  else
  {
    a <- select(GeneList$Annotation_Library, keys = Gene_Symbol, columns=c("ENTREZID", "SYMBOL"),keytype="SYMBOL")
    if("EXONRANK"%in%keytypes(GeneList$transcript_reference)){
      b <- select(GeneList$transcript_reference, keys = a$ENTREZID, columns=c('GENEID', 'TXNAME'),keytype="GENEID")
    }else{
      b <- select(GeneList$transcript_reference, keys = a$ENTREZID, columns=c('GENEID', 'TXNAME', 'EXONRANK'),keytype="GENEID")
    }
  }

I forked this project and made the above changes in that branch. Could you please check my modifications?

https://github.com/wong-ziyi/Ularcirc

Thanks.

davhum commented 2 years ago

That makes a lot of sense. Thanks for identifying issue. Will double check and merge. Cheers, D

davhum commented 2 years ago

Thanks again for your suggestions. Have incorporated changes in latest update.