GoekeLab / bambu

Reference-guided transcript discovery and quantification for long read RNA-Seq data
GNU General Public License v3.0
171 stars 22 forks source link

Transcripts identified by Bambu have a CPM or count value of 0 in all samples #422

Closed yuyun-zhang closed 1 month ago

yuyun-zhang commented 2 months ago

Hi,

Thanks for developing bambu, it's very helpful in my project. I have a question about bambu: why transcripts identified by Bambu have a CPM or count value of 0 in all samples? I have 9 samples, and run all samples together. Bambu output total 210,028 transcripts, but I found there are 125,254 transcripts with CPM of 0 in all sample in the output file CPM_transcript.txt. And 125,003 transcripts with count value of 0 in counts_transcript.txt.

I think if a transcript can be detected and output to the extended_annotations.gtf, it should has support reads and the expression value may not be 0. So I am wondering if there are any rules in quantifying transcripts and correcting the value to 0?

Thank you!

andredsim commented 2 months ago

Hi,

I cannot be sure without seeing the results, but my best guess is that these 125,254 annotations are reference annotations that are not expressed in your sample. Bambu does not remove any annotations as it is hard to prove a transcript is not expressed, rather than simply not detected, that is why we extend the annotations, and only add transcripts to the final annotations. We then rely on the expectation maximization algorithm to reduce the impact of the potentially unexpressed transcripts.

If you are only interested in transcripts which have full length support you can filter the output to only include those that were detected by following the steps under this section of the documentation https://github.com/GoekeLab/bambu?tab=readme-ov-file#Output.

Kind Regards, Andre Sim

yuyun-zhang commented 2 months ago

Hi Andre,

Thanks for your reply! Out of the 125,254 transcripts, I found 2 novel transcripts (BambuTx584 and BambuTx1634). The remaining transcripts are reference annotations that are not expressed in my samples.

Here are my output files: CPM_transcript.txt counts_transcript.txt

Here is an example of some transcripts with CPM=0 in all samples: BambuTx584 ENSG00000253333.1 0 0 0 0 0 0 0 0 0 BambuTx1634 ENSG00000250462.8 0 0 0 0 0 0 0 0 0 ENST00000003084.10 ENSG00000001626.15 0 0 0 0 0 0 0 0 0 ENST00000005226.11 ENSG00000006611.15 0 0 0 0 0 0 0 0 0 ENST00000006015.3 ENSG00000005073.5 0 0 0 0 0 0 0 0 0 ENST00000006724.7 ENSG00000007306.14 0 0 0 0 0 0 0 0 0 ENST00000007735.3 ENSG00000006059.3 0 0 0 0 0 0 0 0 0 ENST00000009180.9 ENSG00000010278.13 0 0 0 0 0 0 0 0 0 ENST00000010299.10 ENSG00000009780.15 0 0 0 0 0 0 0 0 0

These 2 BambuTx also with count=0 in all samples in counts_transcripts.txt file.

Yuyun

andredsim commented 2 months ago

Hi Yuyun,

We discussed this internally to try and determine why you are getting these results. We came up with the following:

  1. We recently released a new version (3.4.1 release 3_18) which fixes an issue with unspliced read counts being assigned incorrectly. Could you please try redownload the devel branch from github and see if these changes these results.
  2. Could you please check if BambuTx584 and BambuTx1634 are subsets (incomplete splice matches) of another novel transcript from their respective genes as well as the expression of all transcripts from these genes. It is possible the expression of one transcript in the gene is dwarfing the novel transcript. If you are unsure how to do this, you can post the full output of rowRanges(se)[mcols(se)$GENEID == "ENSG00000253333.1"] and assays(se)$counts[mcols(se)$GENEID == "ENSG00000253333.1",] outputs here.
  3. Can please you post the script you used to run bambu? So that we can try understand why that is happening it will help us to know which parameters you were using.

If we cannot diagnose it from the above steps, it will be helpful for us to have a look at your intermediate files to try and solve this for you. This would involve you sending us the read class files .rds rcOutDir = "path/somewhere, the final se object with assignDist=TRUE and the full console output with verbose=TRUE when running Bambu.

Kind Regards, Andre Sim

yuyun-zhang commented 2 months ago

Hi Andre,

Thanks for the detailed advice! I used 3.4.0 bambu to obtain these results. I am trying to install new version (3.4.1 release 3_18) and will send the intermediate files if I successfully install the new version. Here is my script using 3.4.0:

library(bambu)
library(BiocFileCache)
# annotations <- prepareAnnotations(gtf)
annotations <- readRDS("bamboo_annotations.rds")
# se1 <- bambu(reads = sample, rcOutDir = "./bambu_rcfiles/", annotations = annotations, genome = genome)
bfc <- BiocFileCache("./bambu_rcfiles/", ask = FALSE)
info <- bfcinfo(bfc)

# I used results from the command below
se <- bambu(reads = info$rpath, annotations = annotations, genome = genome, opt.discovery = list(min.txScore.singleExon = 0)) 
saveRDS(se,"bambu_multi_singleexone.out.rds")
writeBambuOutput(se, path = "./bambu_multi_singleexone/")

Here is output to check if BambuTx584 and BambuTx1634 are subsets of another novel transcript (It seems that BambuTx584 and BambuTx1634 are the only novel transcript from their respective genes):

> rowRanges(se)[mcols(se)$GENEID == "ENSG00000253333.1"]
GRangesList object of length 2:
$BambuTx584
GRanges object with 2 ranges and 2 metadata columns:
      seqnames            ranges strand | exon_rank exon_endRank
         <Rle>         <IRanges>  <Rle> | <numeric>    <numeric>
  [1]     chr5 70480867-70481538      - |         2            1
  [2]     chr5 70495024-70495198      - |         1            2
  -------
  seqinfo: 195 sequences from an unspecified genome; no seqlengths

$ENST00000509969.2
GRanges object with 3 ranges and 2 metadata columns:
      seqnames            ranges strand | exon_rank exon_endRank
         <Rle>         <IRanges>  <Rle> | <numeric>    <numeric>
  [1]     chr5 70495100-70495198      - |         3            1
  [2]     chr5 70497037-70497121      - |         2            2
  [3]     chr5 70500717-70500903      - |         1            3
  -------
  seqinfo: 195 sequences from an unspecified genome; no seqlengths
> assays(se)$counts[mcols(se)$GENEID == "ENSG00000253333.1",]
                  ad_ENCFF156TTD_clean.sort ad_ENCFF311CZO_clean.sort
BambuTx584                                0                         0
ENST00000509969.2                         0                         0
                  ad_ENCFF446EFU_clean.sort ad_ENCFF708BOP_clean.sort
BambuTx584                                0                         0
ENST00000509969.2                         0                         0
                  ad_ENCFF785KVJ_clean.sort control_ENCFF206TQZ_clean.sort
BambuTx584                                0                              0
ENST00000509969.2                         0                              0
                  control_ENCFF260AWP_clean.sort control_ENCFF827DUW_clean.sort
BambuTx584                                     0                              0
ENST00000509969.2                              0                              0
                  control_ENCFF838DFB_clean.sort
BambuTx584                                     0
ENST00000509969.2                              0
> rowRanges(se)[mcols(se)$GENEID == "ENSG00000250462.8"]
GRangesList object of length 4:
$BambuTx1634
GRanges object with 3 ranges and 2 metadata columns:
      seqnames            ranges strand | exon_rank exon_endRank
         <Rle>         <IRanges>  <Rle> | <numeric>    <numeric>
  [1]    chr17 30631896-30631979      + |         1            3
  [2]    chr17 30633098-30633390      + |         2            2
  [3]    chr17 30633994-30634322      + |         3            1
  -------
  seqinfo: 195 sequences from an unspecified genome; no seqlengths

$ENST00000398851.7
GRanges object with 5 ranges and 2 metadata columns:
      seqnames            ranges strand | exon_rank exon_endRank
         <Rle>         <IRanges>  <Rle> | <numeric>    <numeric>
  [1]    chr17 30629680-30629794      + |         1            5
  [2]    chr17 30631755-30631895      + |         2            4
  [3]    chr17 30633098-30633390      + |         3            3
  [4]    chr17 30633994-30634098      + |         4            2
  [5]    chr17 30637243-30637302      + |         5            1
  -------
  seqinfo: 195 sequences from an unspecified genome; no seqlengths

$ENST00000412831.1
GRanges object with 3 ranges and 2 metadata columns:
      seqnames            ranges strand | exon_rank exon_endRank
         <Rle>         <IRanges>  <Rle> | <numeric>    <numeric>
  [1]    chr17 30631755-30632099      + |         1            3
  [2]    chr17 30633098-30633390      + |         2            2
  [3]    chr17 30633994-30634170      + |         3            1
  -------
  seqinfo: 195 sequences from an unspecified genome; no seqlengths

$ENST00000417404.2
GRanges object with 2 ranges and 2 metadata columns:
      seqnames            ranges strand | exon_rank exon_endRank
         <Rle>         <IRanges>  <Rle> | <numeric>    <numeric>
  [1]    chr17 30632122-30633390      + |         1            2
  [2]    chr17 30633994-30637466      + |         2            1
  -------
  seqinfo: 195 sequences from an unspecified genome; no seqlengths
> assays(se)$counts[mcols(se)$GENEID == "ENSG00000250462.8",]
                  ad_ENCFF156TTD_clean.sort ad_ENCFF311CZO_clean.sort
BambuTx1634                               0                         0
ENST00000398851.7                         0                         0
ENST00000412831.1                         0                         0
ENST00000417404.2                         7                        42
                  ad_ENCFF446EFU_clean.sort ad_ENCFF708BOP_clean.sort
BambuTx1634                               0                         0
ENST00000398851.7                         0                         0
ENST00000412831.1                         0                         0
ENST00000417404.2                        42                        10
                  ad_ENCFF785KVJ_clean.sort control_ENCFF206TQZ_clean.sort
BambuTx1634                               0                              0
ENST00000398851.7                         0                              0
ENST00000412831.1                         0                              0
ENST00000417404.2                         3                              3
                  control_ENCFF260AWP_clean.sort control_ENCFF827DUW_clean.sort
BambuTx1634                                    0                              0
ENST00000398851.7                              0                              0
ENST00000412831.1                              0                              0
ENST00000417404.2                              4                              8
                  control_ENCFF838DFB_clean.sort
BambuTx1634                                    0
ENST00000398851.7                              0
ENST00000412831.1                              0
ENST00000417404.2                              8

Best, Yuyun

andredsim commented 2 months ago

Hi Yuyun,

Thanks for sending this, this was helpful to help us rule out a few possible reasons. Did running Bambu with the latest version change the results? If not, unfortunately the reason for this eludes us so we will need to delve deeper. We have two possible explanations, one being that the second example, the counts are being assigned to the other gene which overlaps these same exons, and the other being that the final transcript models are slightly different from the read classes when multiple samples are merged, so the original read class is assigned elsewhere.

To solve this, are you able to rerun your analysis with assignDist=TRUE and rcOutDir = "path/somewhere and send us both the final se object (saved with saveRDS) and the output saved in rcOutDir and either attach it here, or send us an email with the files so that we can look into this further for you?

Kind Regards, Andre Sim

yuyun-zhang commented 1 month ago

Hi Andre,

Sorry for the delayed response, and thank you for providing me with advice!

I installed the bambu 3.4.1 (release_3_18), but I can't use the parameter assignDist=TRUE. Here is the command and error information. Could you please guide me how to use assignDist=TRUE? I didn't find this parameter in Bambu Arguments.

se <- bambu(reads = sample, annotations = annotations, genome = genome, opt.discovery = list(min.txScore.singleExon = 0),rcOutDir = "./", assignDist=TRUE) 

Error in bambu(reads = sample, annotations = annotations, genome = genome,  : 
  unused argument (assignDist = TRUE)
Execution halted

Now I am runing the command below, and waiting for the results.

se <- bambu(reads = sample, annotations = annotations, genome = genome, opt.discovery = list(min.txScore.singleExon = 0),rcOutDir = "./") 

Thank you again! Yuyun

yuyun-zhang commented 1 month ago

After running the command, I got an error, and I'm currently trying to resolve it. I am wondering if you have any suggestions? Thank you!

se <- bambu(reads = sample, annotations = annotations, genome = genome, opt.discovery = list(min.txScore.singleExon = 0),rcOutDir = "./") 

Error: BiocParallel errors
  1 remote errors, element index: 1
  8 unevaluated and other errors
  first remote error:
Error in (function (cond) : error in evaluating the argument 'x' in selecting a method for function 'bfcnew': Failed to collect lazy table.
Caused by error in `db_collect()`:
! Arguments in `...` must be used.
✖ Problematic argument:
• ..1 = Inf
ℹ Did you misspell an argument name?
In addition: Warning messages:
1: There was 1 warning in `summarise()`.
ℹ In argument: `intersectWidth = max(intersectWidth)`.
Caused by warning in `max()`:
! no non-missing arguments to max; returning -Inf 
2: Returning more (or less) than 1 row per `summarise()` group was deprecated in dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()` always returns an ungrouped data frame and adjust accordingly.
ℹ The deprecated feature was likely used in the bambu package.
  Please report the issue to the authors.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated. 
andredsim commented 1 month ago

Hi Yuyun,

Sorry for the delayed response. This is very odd you are getting the error that the assignDist parameter is not there, as it has been in our Bambu versions for awhile. Would you be able to send the sessionInfo() output (once you have loaded in Bambu)? One issue might be the R version you have installed which limits the maximum Bambu version that will be downloaded with Bioconductor. This might also explain the second error you found where it cannot find the method for bfcnew.

Thanks, Andre Sim

yuyun-zhang commented 1 month ago

Hi Andre,

Thank you! Here is the sessionInfo() output:

> sessionInfo()
R version 4.2.0 (2022-04-22)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /u/home/y/yuyun/miniconda3/envs/bambu/lib/libopenblasp-r0.3.27.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] bambu_3.4.1                 BSgenome_1.62.0            
 [3] rtracklayer_1.58.0          Biostrings_2.66.0          
 [5] XVector_0.38.0              SummarizedExperiment_1.28.0
 [7] Biobase_2.58.0              GenomicRanges_1.50.0       
 [9] GenomeInfoDb_1.34.9         IRanges_2.32.0             
[11] S4Vectors_0.36.0            BiocGenerics_0.44.0        
[13] MatrixGenerics_1.10.0       matrixStats_1.3.0          

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.12              lattice_0.22-6           tidyr_1.3.1             
 [4] prettyunits_1.2.0        png_0.1-8                Rsamtools_2.14.0        
 [7] digest_0.6.35            utf8_1.2.4               BiocFileCache_2.2.1     
[10] R6_2.5.1                 RSQLite_2.3.6            httr_1.4.7              
[13] pillar_1.9.0             zlibbioc_1.44.0          rlang_1.1.3             
[16] GenomicFeatures_1.46.5   progress_1.2.3           curl_4.3.3              
[19] data.table_1.15.4        blob_1.2.4               Matrix_1.6-5            
[22] BiocParallel_1.32.5      stringr_1.5.1            RCurl_1.98-1.12         
[25] bit_4.0.5                biomaRt_2.50.3           DelayedArray_0.24.0     
[28] compiler_4.2.0           pkgconfig_2.0.3          tidyselect_1.2.1        
[31] KEGGREST_1.34.0          tibble_3.2.1             GenomeInfoDbData_1.2.9  
[34] codetools_0.2-20         XML_3.99-0.14            fansi_1.0.6             
[37] crayon_1.5.2             dplyr_1.1.4              dbplyr_2.5.0            
[40] GenomicAlignments_1.34.0 bitops_1.0-7             rappdirs_0.3.3          
[43] grid_4.2.0               jsonlite_1.8.8           lifecycle_1.0.4         
[46] DBI_1.2.2                magrittr_2.0.3           cli_3.6.2               
[49] stringi_1.7.6            cachem_1.0.8             xml2_1.3.4              
[52] filelock_1.0.3           vctrs_0.6.5              generics_0.1.3          
[55] xgboost_1.7.7.1          rjson_0.2.21             restfulr_0.0.15         
[58] tools_4.2.0              bit64_4.0.5              glue_1.7.0              
[61] purrr_1.0.2              hms_1.1.3                parallel_4.2.0          
[64] fastmap_1.1.1            yaml_2.3.5               AnnotationDbi_1.56.2    
[67] memoise_2.0.1            BiocIO_1.8.0            
andredsim commented 1 month ago

Hi Yuyun,

Very sorry I made a mistake earlier. We actually renamed assignDist to returnDistTable, and I forgot about that change. Could you please try rerun it with that instead. Regarding the other issue, this seems to be stemming from rcOutDir and I am getting the same error on my side too which I am trying to resolve now. Seems like the parameters for the bfcnew function are different now.

While I fix this could you please run it as follows.

rcfile <- bambu(reads = sample, annotations = annotations, genome = genome, opt.discovery = list(min.txScore.singleExon = 0), discovery = FALSE, quant = FALSE) 
saveRDS(rcfile, "./rcfile.rds")
se <- bambu(reads = rcfile, annotations = annotations, genome = genome, opt.discovery = list(min.txScore.singleExon = 0),rcOutDir = "./", returnDistTable=TRUE) 
saveRDS(rcfile, "./se.rds")

Kind Regards, Andre Sim

andredsim commented 1 month ago

The error Error in (function (cond) : error in evaluating the argument 'x' in selecting a method for function 'bfcnew': Failed to collect lazy table. Caused by error in db_collect():

Is due to an incompatibility between the latest dbplyr 2.5.0 and BiocFileCache. I downgraded to dbplur 2.3.4 and restarted R and the error went away as suggested in https://stackoverflow.com/questions/77370659/error-failed-to-collect-lazy-table-caused-by-error-in-db-collect-using. devtools::install_version("dbplyr", version = "2.3.4") This should hopefully fix this error

yuyun-zhang commented 1 month ago

Hi Andre,

Thanks for your information. I downgraded dbplur to 2.3.4 and the issue was fixed. After using bambu 3.4.1 (release_3_18), all novel transcripts are expressed (count value > 0 and CPM > 0) in at least 1 sample. I got the results by running following command:

rcfile <- bambu(reads = sample, annotations = annotations, genome = genome, opt.discovery = list(min.txScore.singleExon = 0), discovery = FALSE, quant = FALSE) 
saveRDS(rcfile, "./rcfile.rds")
se <- bambu(reads = rcfile, annotations = annotations, genome = genome, opt.discovery = list(min.txScore.singleExon = 0),rcOutDir = "./", returnDistTable=TRUE) 
saveRDS(se, "./se.rds")

I am wondering if you still need to check the rcfile.rds and se.rds file? If so, I can send these files to your email. Thank you again!

Yuyun

andredsim commented 1 month ago

I am glad the issue is fixed. No need to send these files if the issue was due to version differences because of the dependencies. I will close this now, but if any other related issues arise feel free to reopen it or create a new issue. Best of luck with your research!