f-hamidlab / factR2

Functional Annotation of Custom Transcriptomes v2
https://f-hamidlab.github.io/factR2/
Apache License 2.0
3 stars 1 forks source link

Error in createfactRObject on example data #8

Open ChrissiKalk97 opened 1 week ago

ChrissiKalk97 commented 1 week ago

Hello,

I would like to run factR2 on a custom assembly where I ran in the issue described in https://github.com/f-hamidlab/factR2/issues/7. So I thought I try it on the example data you provided pb_custom.gtf.gz. I run the following code in R4.3.2:

library(factR2) library(rtracklayer) gtf.file <- system.file("pb_custom.gtf.gz", package = "factR2") fobj <- createfactRObject(gtf.file, reference = "vM33") 🡆 Checking inputs Warning: package ‘Biostrings’ was built under R version 4.3.3 🡆 Checking factRObject 🡆 Adding custom transcriptome Error in .smartimport(): ! Input not recognized or file does not exists Backtrace:

  1. factR2::createfactRObject(gtf.file, reference = "vM33")
  2. factR2:::.smartimport(gtf, ".gtf", "Custom transcriptome", verbose)

Is this an issue with the R version? Does it work if I update it, or is it another problem?

Thanks in advance for the help!

Best regards,

Christina Kalk

fursham-h commented 1 week ago

Hi @ChrissiKalk97, please rerun the system.file() function as such:

system.file("extdata/pb_custom.gtf.gz", package = "factR2")
ChrissiKalk97 commented 1 week ago

Thank you @fursham-h for your quick reply. I now ran:

gtf.file <- system.file("extdata/pb_custom.gtf.gz", package = "factR2") fobj <- createfactRObject(gtf.file, reference = "vM33") 🡆 Checking inputs 🡆 Checking factRObject 🡆 Adding custom transcriptome ℹ Importing from local directory Registered S3 method overwritten by 'data.table': method from print.data.table
🡆 Adding annotation ℹ Importing from URL 🡆 Adding genome sequence ℹ Using BSgenome object Warning: 4 seqlevel(s) in your custom GTF are not found in reference annotation 🡆 Matching chromosome names 🡆 Matching gene information Number of mismatched gene_ids found: 83520 --> Attempting to match ensembl gene_ids... --> No ensembl gene ids found in query ---> Attempting to match gene_ids by finding overlapping coordinates... ---> 65203 gene_id matched Total gene_ids corrected: 65203 Remaining number of mismatched gene_ids: 18317 🡆 Creating factRset objects ℹ Adding gene information ℹ Adding transcript information ℹ Adding alternative splicing information Warning: Expected 3 pieces. Additional pieces discarded in 9 rows [1, 2, 3, 4, 5, 6, 7, 8, 9]. Warning in .merge_two_Seqinfo_objects(x, y) : Each of the 2 combined objects has sequence levels not in the other:

  • in 'x': MU069434.1, MU069435.1, GL456233.2, GL456370.1, JH584304.1
  • in 'y': JH584304 Make sure to always combine/compare objects based on the same reference genome (use suppressWarnings() to suppress this warning). 🡆 Annotating novel transcripts 🡆 Annotating novel AS events Error in dplyr::mutate(): ℹ In argument: coord = paste0(seqnames, ":", start, "-", end). Caused by error in as.vector(): ! cannot coerce type 'closure' to vector of type 'character' Backtrace:
    1. factR2::createfactRObject(gtf.file, reference = "vM33")
    2. base::paste0(seqnames, ":", start, "-", end)
    3. base::as.character.default(<stndrdGn>)

I think this is now the same problem as described in https://github.com/f-hamidlab/factR2/issues/7. You did not come to find a fix for this yet?

Best,

Christina

fursham-h commented 6 days ago

Hi @ChrissiKalk97

Thank you for pointing that out. We have fixed that issue in the commit https://github.com/f-hamidlab/factR2/commit/1c6ad2e321d5dda8512f65cd845285d131904643, as well as other bugs that we found along the way. Do test it out and let us know if it still crops up.

ChrissiKalk97 commented 5 days ago

Thank you @fursham-h It is now working for the example data and also for my custom transcriptome. I ran: seqlevels(ref.gtf) <-unique(c(seqlevels(custom.gtf), seqlevels(ref.gtf))) extended_Ens_110.gtf <-c(ref.gtf, custom.gtf) custfactRobj <- createfactRObject(extended_Ens_110.gtf, "v44") I get a warning to match chromosome names, but I guess this should be fine as the reference and the custom file have the same chromosome naming.

I still have some more questions on the usage and output of factR2:

I can create a factRobject and build the CDS as well as determine the NMD status. When I try to run the example of the vignette for getting the number of NMD sensitive and insensitive transcripts, I obtain the following error: `features(custfactRobj) %>%

This is not an issue as I can simply count the occurrence of "yes" and "no" in the custfactRobj@sets$transcript@rowData$nmd column.

What I would be interested in is to obtain the coordinates of the defined CDS. Is this somehow possible? The custfactRobj@sets$transcript@rowData$cds column simply states "yes" or "no". Does "yes" mean that a CDS could be determined and "no" that it was not possible? I tried to export a GTF of the factRobject for which the CDS was determined. But this only contained the CDS information of the reference assembly and not for the transcripts of the custom assembly. As the 3' UTR length is reported I assume, that I could calculate the transcript length and substract the 3'UTR length to obtain the CDS end coordinate. Is it also possible to obtain the CDS start coordinate?

Sorry to bother you with so many questions...Thanks in advance :)

fursham-h commented 5 days ago

Hey. So there are several things here.

  1. features(custfactRobj) %>% group_by(nmd) %>% tally() will only work if you set activeSet() to "transcript". You can use txs(custfactRobj) instead.
  2. This must be a bug, the CDSs of the new transcripts should be in custfactRobj@transcriptome. Let me take a look and fix this.