GoekeLab / bambu

Reference-guided transcript discovery and quantification for long read RNA-Seq data
GNU General Public License v3.0
171 stars 22 forks source link

Incomplete result from minimap2-aligned bam files #395

Closed zhangkn3 closed 8 months ago

zhangkn3 commented 8 months ago

Hi, I have been using bambu lately for novel isoform identification from long-read RNA-seq data these days. And it is efficient and useful.

However, an issue happened when performing novel isoform identification from minimap2 aligned bam files. The metadata only included 4 columns, and the novelTranscript information is lacking. Hence, the novel isoforms can not be identified.

Pic

But running bambu from STAR-aligned bam files could be well performed. Pic2

I hope I have described this issue clearly. Any hint on how to solve the issue is very appreciated!

Best, Kenan

zhangkn3 commented 8 months ago

Additionally, the UCSC reference genome (hg38.fa) and annotation file (hg38.knownGene.gtf) were used in the code. Thanks!

andredsim commented 8 months ago

Hi Kenan,

Thanks for reaching out. I am not yet too sure what could be causing this, as the aligner theoretically should not impact the output format of Bambu. Could you please share the lines of code you used to generate these two screenshots so I can get a better picture of what might be happening. You can rename all sample files if you need to maintain privacy.

zhangkn3 commented 8 months ago

Hi Andre, Thanks for your prompt response. Here are my codes used for the bambu project. The only difference is the input bam files I gave to these codes. Also, thanks for your kind help.

bambu project

library(MatrixGenerics) library(sparseMatrixStats) library(DelayedMatrixStats) library(Rsamtools) library(bambu) library(BSgenome) gencode_seq <- "/project/lrRNA/bambu/hg38.fa" gencode_gtf <- prepareAnnotations("/project/lrRNA/bambu/hg38.knownGene.gtf") bam_file <- "/project/lrRNA/STARlong/GBM2010Aligned.sortedByCoord.out.bam" ont_iso_analysis <- bambu(reads = bam_file, annotations = gencode_gtf, genome = gencode_seq, quant = FALSE, ncore = 8) writeToGTF(ont_iso_analysis, "/project/lrRNA/bambu/output10M.gtf") ont_iso_analysis.novel <- ont_iso_analysis[mcols(ont_iso_analysis)$novelTranscript,] writeToGTF(ont_iso_analysis.novel, "/project/lrRNA/bambu/output.novel_isoform10M.gtf")

andredsim commented 8 months ago

Hi Kenen,

Thanks for this, I noticed in the first screen shot you use mcols(newAnnotations) which isn't a variable you included in the code you posted, whereas the second is of ont_iso_analysis. Could I ask where newAnnotations comes from. If you save the output to GTF and then reread that GTF in with prepareAnnotations it unfortunately it does not keep all the metadata as we only save the standard GTF fields. (In an upcoming update we will be saving the NDR values in the gtf file) If you would like to save the annotations with the metadata included so you can use it for future analysis I recommend using saveRDS(ont_iso_analysis, "/save/path") Let me know if this might be the cause, otherwise I can keep trying to figure out what might have caused this.

Kind Regards, Andre Sim

zhangkn3 commented 8 months ago

Thanks Andre! The newAnnotations is also the same as not_iso_analysis, I did not save the GTF and reread it. But I will try the code you recommended, and get back to you.

Best, Kenan

andredsim commented 8 months ago

Hi Kenen,

I tried some more tests on my side and havn't figured out how this could occur. Perhaps you could sample your two bam files down to an easily transferable size, check the issue still occurs, and if so and it is not sensitive data, upload them here and I can try run them to see if I can replicate the issue. This will make it a lot easier for me to troubleshoot.

Kind Regards, Andre

zhangkn3 commented 8 months ago

Hi Andre, I have rerun the codes and here is the result from the minimap2 aligned bam. I have added the warnings in the screenshot. Hope this could help you find the trouble. Thx, Kenan image

andredsim commented 8 months ago

Oh I think I see the issue The warning "The current filtering criteria filters out all new read classes, please consider less stringent critera" What this means is no new transcript models are predicted so it returns the original annotations. There are a few reasons this could happen.

  1. The default NDR threshold is too low and Bambu is being too stringent for your sample. You could try run it with NDR = 1 which is maximum sensitivity.
  2. The two bam files were not aligned to the same genome and for the bam file that is not working, the chromosome names do not match the chromosome names in the genome you have provided. This is a very common issue and often occurs from differences between "chr1" and "1" as scaffold names.
  3. It could be that no read classes were constructed, potentially as a consequence of 2. or other reasons. If the above to steps do not solve the issue can you send me the output from the metadata(output)$warnings
  4. Is this a standard sequencing run or is special prep involved? Certain protocols such as targeted sequencing need to be handled different or this error will be encountered.

Let me know how it goes.

zhangkn3 commented 8 months ago

Thanks Andre. I have checked your suggestions.

  1. In the codes I used, "quant = FALSE" should be echos to "NDR = 1" according to the bambu manual. I also tried the new code with "NDR =1", and the result was still the same.
  2. The bam files aligned by minimap2 and STAR_long used the same reference genome. I also checked the scaffold names of the bam files and the reference genome are both "chr1".
  3. Here are the warnings. Please check. image
  4. This is a standard long-read RNA sequencing data produced by NanoPore from tumor tissue.

Great thanks for your kind help! Kenan

andredsim commented 8 months ago

Hi Kenan,

Thanks for checking that. The warning number 3 is the important one here. Do you expect no spliced reads in your dataset? Given this is a standard tumor tissue in human I assume that is unlikely. Common causes for this is if the alignment was done without splice aware mode. Minimap2 is recommended to be run like this ./minimap2 -ax splice -uf -k14 ref.fa reads.fa > aln.sam for nanopore reads. Alternatively alignments need to be done to the genome and not the transcriptome, which would also cause unspliced reads, however given you said the scaffold names are "chr1" this is less likely.

Regarding point 1. NDR = 1 and quant = FALSE are very different. Setting quant to false stops Bambu from quantifying the output and will just return the filtered annotations (Filtered by NDR value). NDR = 1 runs the discovery step at its most sensitive meaning it will return all annotations regardless of their NDR value. If it doesn't trouble you, could you let me know what part of the manual this was, so that I can adjust it so it is clearer for future users!

Let me know if the alignment was the issue or if you do only expect un-spliced reads. Thanks, Andre

zhangkn3 commented 8 months ago

Thanks, Andre. I think the issue must be happened with the minimap2 alignment. I will revise the code and rerun bambu, and come back to you later. Thanks a lot for your kind help! Kenan

DepledgeLab commented 8 months ago

Hello, I've been testing bambu on a dataset i am interested in and also get the warning ""The current filtering criteria filters out all new read classes, please consider less stringent criteria"

I'm running it as follows

se <- bambu(reads = "../rawdata/input.genome.sorted.bam", annotations = NULL, genome = "../rawdata/ref.fasta", NDR = 1, quant = FALSE) writeToGTF(se, "bambu.gtf")

and get the following warnings/errors

Detected 4 warnings across the samples during read class construction. Access warnings with metadata(bambuOutput)$warnings
--- Start extending annotations ---
The current filtering criteria filters out all new read 
            classes, please consider less strigent critria!

Error in `auto_copy()`:
! `x` and `y` must share the same src.
i `x` is a <tbl_df/tbl/data.frame> object.
i `y` is `NULL`.
i Set `copy = TRUE` if `y` can be copied to the same source as `x` (may be slow).
Run `rlang::last_trace()` to see where the error occurred.
Warning message:
Unknown or uninitialised column: `exon_rank`. 
> se.novel = se[mcols(se)$fullLengthCounts >= 0.1,]
> writeToGTF(se, "bambu.gtf")
Error in `auto_copy()`:
! `x` and `y` must share the same src.
i `x` is a <tbl_df/tbl/data.frame> object.
i `y` is `NULL`.
i Set `copy = TRUE` if `y` can be copied to the same source as `x` (may be slow).
Run `rlang::last_trace()` to see where the error occurred.
Warning message:
Unknown or uninitialised column: `exon_rank`. 

The system I'm working in has predominantly single exon transcripts and so I am wondering if I need to configure bambu differently to deal with this?

andredsim commented 8 months ago

Hi,

Yes by default Bambu excludes all single exon transcript. Please see this section of our documentation https://github.com/GoekeLab/bambu#Including-single-exons. You need to add this parameter to prevent them being filtered out. opt.discovery = list(min.txScore.singleExon = 0)) Bambu is however designed for spliced transcripts and was only evaluated using them so I cannot vouch for the performance when there is predominately single exon transcripts as this is something we were hoping to look into in the future. I would be interested to hear how it performs.

Kind Regards, Andre Sim

DepledgeLab commented 8 months ago

That is super helpful. thank you very much!

andredsim commented 8 months ago

Hi Zhang, Since we have not heard back I hope this solved you issue and will close it for now. If it hasn't let us know please write back with a detailed description of what is occurring and preferentially a reproducible example and we can reopen it.

zhangkn3 commented 8 months ago

Hi Andre, Thanks for your response! Our HPC broke out last week, and all the jobs were killed. I am now rerunning the minimap2 alignment pipeline. I will get back to you after I get the results. Thanks again.

andredsim commented 8 months ago

Ok sorry to hear about the killed jobs, if you are still having issues once it finishes I can reopen this.

zhangkn3 commented 8 months ago

Hi Andre, Thanks for your kind help. The issue is because of the minimap2 alignment codes. Bambu can run well with the minimap2 -ax splice results. Additionally, I rechecked the manual and found the misunderstanding of NDR = 1 and quant = FALSE should be my fault.

Great thanks for your assistance, and thanks again for the brilliant software. Kenan

andredsim commented 8 months ago

Glad to hear everything is resolved! I hope you get some nice results!