Closed msagniez closed 1 year ago
Hi, thanks for reporting this.
As a temporary fix I believe commenting this line may resolve this issue. https://github.com/GoekeLab/bambu/blob/ffa53178915bc4f939f80f489a511f1b1861a27c/R/bambu.R#L174
However I would like to understand why this is occurring. If this isn't sensitive data would it be possible to rerun bambu with rcOutDir = "/path/somewhere" and send me the .rds file that will be output? This would greatly help me debug this. From what I gather from the error message it seems the sample name of your bam file is removed at some point.
Hi Andre ! Thank you very much for your proactive answer. Could I email you the output files directly? It's not "sensitive" data per se but I'd prefer to keep it private.
On a side note, it seems like adding the rcOutDir = "/path/somewhere" option resolves the error:
se <- bambu(reads = test.bam, annotations = bambuAnnotations, genome = fa.file, verbose=TRUE, rcOutDir = "/scratch/melanie/")
--- Start generating read class files ---
'getOption("repos")' replaces Bioconductor standard repositories, see 'help("repositories", package =
"BiocManager")' for details.
Replacement repositories:
CRAN: https://cloud.r-project.org
Sample1_Guppy6_sup_400bps_mm2spliceNoseck14_GenSorted
Finished creating junction list with splice motif in 0 mins.
before strand correction, annotated introns: 671 (0.0263685306715919%)
Junction correction with not enough data, precalculated model is used
Model to predict true splice sites built in 0 mins.
reads count for all annotated junctions: 9684610 (0.94367049304652%)
reads count for all annotated junctions after correction to reference junction: 9802362 (0.955144273394641%)
Finished correcting junction based on set of high confidence junctions in 0 mins.
Finished creating transcript models (read classes) for reads with spliced junctions in 0.3 mins.
Finished create single exon transcript models (read classes) in 0.1 mins.
On the dataset the new trained model achieves a ROC AUC of 1 and a Precision-Recall AUC of 0.993.This is compared to the Bambu pretrained model trained on human ONT RNA-Seq data model which achiveves a ROC AUC of 0.985 and a Precision-Recall AUC of 0.651
Finished generating scores for read classes in 0.1 mins.
Detected 0 warnings across the samples during read class construction. Access warnings with metadata(bambuOutput)$warnings
--- Start extending annotations ---
combing spliced feature tibble objects across all samples in 0 mins.
extract new unspliced ranges object for all samples in 0 mins.
reduce new unspliced ranges object across all samples in 0 mins.
combine new unspliced tibble object across all samples in 0 mins.
combining transcripts in 0 mins.
extended annotations for spliced reads in 0 mins.
extended annotations for unspliced reads in 0 mins.
WARNING - Less than 50 TRUE or FALSE read classes for NDR precision stabilization.
NDR will be approximated as: (1 - Transcript Model Prediction Score)
-- Predicting annotation completeness to determine NDR threshold --
Recommended NDR for baseline FDR of 0.1 = 0.089
Using a novel discovery rate (NDR) of: 0.089
transcript filtering in 0 mins.
extend annotations in 0.1 mins.
--- Start isoform quantification ---
calculated distance table in 0.1 mins.
added gene Ids for each read class 0 mins.
Finished estimate degradation bias in 0 mins.
Finished EM estimation in 0 mins.
--- Finished running Bambu ---
Warning message:
There was 1 warning in `filter()`.
ℹ In argument: `newGeneId == min(newGeneId)`.
Caused by warning in `min()`:
! no non-missing arguments to min; returning Inf
Out of curiosity I checked bambu outputs with and without the error and it seems like bambu produces the same output files (same md5sum) in both cases. In other words, the error doesn't prevent bambu from outputting files. There's only a small format issue with the .gtf with missed line returns but nothing too bad:
chrIS Bambu transcript 6108776 6119382 . + . gene_id "G1"; transcript_id "G1_T2";
chrIS Bambu transcript 6165874 6166537 . + . gene_id "G14"; transcript_id "G14_T2";
chrIS Bambu transcript 6179429 6181140 . + . gene_id "BambuGene13"; transcript_id "BambuTx8";chrIS Bambu transcript 6179481 6211352 . - . gene_id "BambuGene40"; transcript_id "BambuTx9";chrIS Bambu transcript 6223313 6227908 . + . gene_id "G154"; transcript_id "G154_T1";
chrIS Bambu transcript 6223343 6227344 . + . gene_id "G154"; transcript_id "G154_T2";
chrIS Bambu transcript 6259934 6286200 . - . gene_id "G152"; transcript_id "G152_T1";
chrIS Bambu transcript 6262594 6286217 . - . gene_id "G152"; transcript_id "G152_T2";
Do you yhink these outputs can be trusted?
Thank you for your help !
Melanie
Hi Melanie,
Thanks for this info. I believe the original crash is caused by some unrelated handling of the warnings which shouldn't impact the results at all. Therefore, I would trust the results, especially since the md5sum is the same). Given the latest outputs don't cause the crash you don't need to send me the .rds file anymore as it won't help me debug this issue. That it runs with rcOutDir has given me a clue though.
For now I would recommend using rcOutDir so that you can run bambu and i'll update this issue once I have found the cause for the original crash. Thanks for reporting the missing line returns as well, I will look into that as well.
I also note from this output you are working on sequin spike-ins. Just some tips from my experience using them in Bambu. Depending on the fraction of sequins in the sample you may need to provide a higher NDR threshold (more senstitive, less precise) as their abundance would vary from a normal transcript. Your current analysis was run using a novel discovery rate (NDR) of 0.089. This is especially true if you are running bambu without any annotations where the default NDR thresholds are very stringent. In the event that the sequin fraction is very high, we noticed that this can influence the trained model, and using the default model by turning fit = FALSE, can lead to better results for transcripts on the main chromosomes.
If you have any further questions regarding this feel free to ask!
Kind Regards, Andre Sim
Hi Melanie,
I believe I have a fix for your original error. Changing the below line to:
https://github.com/GoekeLab/bambu/blob/ffa53178915bc4f939f80f489a511f1b1861a27c/R/bambu_utilityFunctions.R#L190
sampleNames = c(sampleNames, colnames(readClassSe))
should resolve it. Feel free to try this in the mean time and if it works please let me know. We will look to roll out this fix in the next update.
However, I wasn't able to replicate the format issue in the gtf. Could I ask what function you used to produce it, so I can compare it to my test cases?
Kind Regards, Andre Sim
Hi Andre,
Thanks a lot for your help. I tried modifying the line you mentionned and unfortunately the error remains even when using the rcOutDir option (that made the error disappear with the original version). Here are both results I got:
> test.bam <- "Sample1_Guppy6_sup_400bps_mm2spliceNoseck14_GenSorted.bam"
> fa.file <- "RefGenome.fa"
> gtf.file <- "RefTranscriptome.gtf"
> bambuAnnotations <- prepareAnnotations(gtf.file)
> se <- bambu(reads = test.bam, annotations = bambuAnnotations, genome = fa.file, verbose=TRUE)
--- Start generating read class files ---
'getOption("repos")' replaces Bioconductor standard repositories, see 'help("repositories", package =
"BiocManager")' for details.
Replacement repositories:
CRAN: https://cloud.r-project.org
Sample1_Guppy6_sup_400bps_mm2spliceNoseck14_GenSorted
Finished creating junction list with splice motif in 0 mins.
before strand correction, annotated introns: 671 (0.0263685306715919%)
[14:44:32] WARNING: src/learner.cc:553:
If you are loading a serialized model (like pickle in Python, RDS in R) generated by
older XGBoost, please export the model by calling `Booster.save_model` from that version
first, then load it back in current version. See:
https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html
for more details about differences between saving model and serializing.
[14:44:32] WARNING: src/learner.cc:553:
If you are loading a serialized model (like pickle in Python, RDS in R) generated by
older XGBoost, please export the model by calling `Booster.save_model` from that version
first, then load it back in current version. See:
https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html
for more details about differences between saving model and serializing.
[14:44:32] WARNING: src/learner.cc:553:
If you are loading a serialized model (like pickle in Python, RDS in R) generated by
older XGBoost, please export the model by calling `Booster.save_model` from that version
first, then load it back in current version. See:
https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html
for more details about differences between saving model and serializing.
[14:44:32] WARNING: src/learner.cc:553:
If you are loading a serialized model (like pickle in Python, RDS in R) generated by
older XGBoost, please export the model by calling `Booster.save_model` from that version
first, then load it back in current version. See:
https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html
for more details about differences between saving model and serializing.
Junction correction with not enough data, precalculated model is used
Model to predict true splice sites built in 0 mins.
reads count for all annotated junctions: 9684610 (0.94367049304652%)
reads count for all annotated junctions after correction to reference junction: 9802362 (0.955144273394641%)
Finished correcting junction based on set of high confidence junctions in 0 mins.
Finished creating transcript models (read classes) for reads with spliced junctions in 0.3 mins.
Finished create single exon transcript models (read classes) in 0.1 mins.
[14:45:03] WARNING: src/learner.cc:553:
If you are loading a serialized model (like pickle in Python, RDS in R) generated by
older XGBoost, please export the model by calling `Booster.save_model` from that version
first, then load it back in current version. See:
https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html
for more details about differences between saving model and serializing.
[14:45:03] WARNING: src/learner.cc:553:
If you are loading a serialized model (like pickle in Python, RDS in R) generated by
older XGBoost, please export the model by calling `Booster.save_model` from that version
first, then load it back in current version. See:
https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html
for more details about differences between saving model and serializing.
On the dataset the new trained model achieves a ROC AUC of 1 and a Precision-Recall AUC of 0.993.This is compared to the Bambu pretrained model trained on human ONT RNA-Seq data model which achiveves a ROC AUC of 0.985 and a Precision-Recall AUC of 0.651
Finished generating scores for read classes in 0.1 mins.
Error in names(warnings) <- sampleNames :
'names' attribute [1] must be the same length as the vector [0]
> test.bam <- "Sample1_Guppy6_sup_400bps_mm2spliceNoseck14_GenSorted.bam"
> fa.file <- "RefGenome.fa"
> gtf.file <- "RefTranscriptome.gtf"
> bambuAnnotations <- prepareAnnotations(gtf.file)
> se <- bambu(reads = test.bam, annotations = bambuAnnotations, genome = fa.file, verbose=TRUE, rcOutDir = "/scratch/melanie/")
--- Start generating read class files ---
'getOption("repos")' replaces Bioconductor standard repositories, see 'help("repositories", package =
"BiocManager")' for details.
Replacement repositories:
CRAN: https://cloud.r-project.org
Sample1_Guppy6_sup_400bps_mm2spliceNoseck14_GenSorted
Finished creating junction list with splice motif in 0 mins.
before strand correction, annotated introns: 671 (0.0263685306715919%)
Junction correction with not enough data, precalculated model is used
Model to predict true splice sites built in 0 mins.
reads count for all annotated junctions: 9684610 (0.94367049304652%)
reads count for all annotated junctions after correction to reference junction: 9802362 (0.955144273394641%)
Finished correcting junction based on set of high confidence junctions in 0 mins.
Finished creating transcript models (read classes) for reads with spliced junctions in 0.3 mins.
Finished create single exon transcript models (read classes) in 0.1 mins.
On the dataset the new trained model achieves a ROC AUC of 1 and a Precision-Recall AUC of 0.993.This is compared to the Bambu pretrained model trained on human ONT RNA-Seq data model which achiveves a ROC AUC of 0.985 and a Precision-Recall AUC of 0.651
Finished generating scores for read classes in 0.1 mins.
Error in names(warnings) <- sampleNames :
'names' attribute [1] must be the same length as the vector [0]
I'm then extracting results with
writeBambuOutput(se, path = "/scratch/melanie/bambu/")
but I no longer have the gtf format issue, it might be a local error/disconnection during the command. Sorry for the false alarm.
Thanks a lot for the advise when handling sequin spike-ins ! We'll make sure to use that in our analyses. If you need anything to debug this further please let me know. On our side I think we'll use the outputs regardless of the error.
Best regards,
Mélanie
For posterity we found the cause of this bug, which occurred when a bam file flagged no warnings (very rare on real data) but would occur on filtered fastqs limited to 1 chromosome or spike-ins. This has been addressed in https://github.com/GoekeLab/bambu/pull/388 I will close this issue now. If you see this error again please feel free to re-comment.
Hi there,
I'm having an issue with bambu which I never had with version 2.2.0 and I was wondering if you could help me with it. I have 4 different bam files (congtaining 4.4M, 4.75M, 356k, 26k alignments). Files were generated with Minimap2 (-ax splice --secondary=no -k14) and converted to bam with Samtools. Reads come from Nanopore sequencing runs. Then here is what I'm running and the error I get:
I get this very error for files with 4.4M and 4.75M alignments only. The other 2 run smoothly. I then tried to subset my bams and Bambu runs correctly with 76 502 alignments for the first file and 67 071 alignments for the second file, not one alignment more. As I used these same bams with other assemblers and never had an issue I'm struggling to find the root cause... Do you have any idea, what might be the problem ?