Closed matomol closed 2 months ago
@matomol Can you provide more information about your basecalling and alignment steps. Specifically what commands were run and what version of Dorado.
$ dorado --version [2024-08-28 12:47:41.624] [info] Running: "--version" 0.7.3+6e6c45c
dorado basecaller hac,5mCG_5hmCG ./pod5 --reference /reference/Nanopore-recommended/seq/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna > ./$OUTPUT_FILE
dorado aligner /reference/Nanopore-recommended/seq/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna $OUTPUT_DIR --output-dir $SAVE_DIR
@matomol I have deleted your comment as I do not know what the nature of the shared data is. If this is data from a reference genome such as HG002, let me know and we will be able to look at it.
In your "Relevant log output" section there should be a work directory path. I'd like to see the tags for one of the implausible reads:
samtools view /path/to/your/bamstats/working/dir/reads.bam | grep '^51ea748d-14f6-4dce-97bc-789706a92b02' | cut -f12- | sed -E 's,(..:.):[^\t]*,\1,g'
Sorry I have no idea where the file "/path/to/your/bamstats/working/dir/reads.bam" resides. There are so many "reads.bam" files in the "variant/workspace" folder.
Is it possible to run this command with the aligned bam file too?
With the original aligned bam file I got:
$ samtools view ./bcalls_hac-mod_np.bam | grep '^51ea748d-14f6-4dce-97bc-789706a92b02' | cut -f12- | sed -E 's,(..:.):[^\t]*,\1,g' qs:f du:f ns:i ts:i mx:i ch:i st:Z rn:i fn:Z sm:f sd:f sv:Z dx:i RG:Z MN:i MM:Z ML:B NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i s2:i MD:Z rl:i SA:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i s2:i MD:Z rl:i SA:Z qs:f du:f ns:i ts:i mx:i ch:i st:Z rn:i fn:Z sm:f sd:f sv:Z dx:i RG:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i s2:i MD:Z rl:i SA:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i s2:i MD:Z rl:i SA:Z qs:f du:f ns:i ts:i mx:i ch:i st:Z rn:i fn:Z sm:f sd:f sv:Z dx:i RG:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i MD:Z rl:i qs:f du:f ns:i ts:i mx:i ch:i st:Z rn:i fn:Z sm:f sd:f sv:Z dx:i RG:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i s2:i MD:Z rl:i SA:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i MD:Z rl:i qs:f du:f ns:i ts:i mx:i ch:i st:Z rn:i fn:Z sm:f sd:f sv:Z dx:i RG:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i s2:i MD:Z rl:i SA:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i MD:Z rl:i qs:f du:f ns:i ts:i mx:i ch:i st:Z rn:i fn:Z sm:f sd:f sv:Z dx:i RG:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i s2:i MD:Z rl:i SA:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i s2:i MD:Z rl:i SA:Z qs:f du:f ns:i ts:i mx:i ch:i st:Z rn:i fn:Z sm:f sd:f sv:Z dx:i RG:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i s2:i MD:Z rl:i SA:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i s2:i MD:Z rl:i SA:Z
@matomol The reads.bam we require will specifically be in the path indicated by the error message you posted. The work dir will appear underneath this section:
Command executed:
samtools view -b -h -L allChromosomes.bed reads.bam | bamstats -s Test -i Test.per-file-runids.txt -l Test.basecallers.tsv --histogram Test-histograms -u -f Test.flagstat.tsv --threads 3 - | gzip > "Test.readstats.tsv.gz"
Command error:
Read '51ea748d-14f6-4dce-97bc-789706a92b02' appears to contain implausible alignment information
See https://labs.epi2me.io/how-to-exits/#how-do-i-know-the-exit-code-of-a-workflow for the anatomy of a Nextflow error.
This is an other read, but it caused the same error. $ samtools view /home/ai/data/240819/15378MN_AS/aligned_np/hac-mod/variant/workspace/7a/25c1cd70068a9dc2b4a1da86a4d01c/reads.bam | grep '^3b40aac0-07f2-44f2-97bc-c95219fc20db' | cut -f12- | sed -E 's,(..:.):[^\t]*,\1,g' qs:f du:f ns:i ts:i mx:i ch:i st:Z rn:i fn:Z sm:f sd:f sv:Z dx:i RG:Z MN:i MM:Z ML:B NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i s2:i MD:Z zd:i rl:i SA:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i s2:i MD:Z zd:i rl:i SA:Z qs:f du:f ns:i ts:i mx:i ch:i st:Z rn:i fn:Z sm:f sd:f sv:Z dx:i RG:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i MD:Z zd:i rl:i qs:f du:f ns:i ts:i mx:i ch:i st:Z rn:i fn:Z sm:f sd:f sv:Z dx:i RG:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i s2:i MD:Z zd:i rl:i SA:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i MD:Z zd:i rl:i qs:f du:f ns:i ts:i mx:i ch:i st:Z rn:i fn:Z sm:f sd:f sv:Z dx:i RG:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i s2:i MD:Z zd:i rl:i SA:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i s2:i MD:Z zd:i rl:i SA:Z qs:f du:f ns:i ts:i mx:i ch:i st:Z rn:i fn:Z sm:f sd:f sv:Z dx:i RG:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i s2:i MD:Z zd:i rl:i SA:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i s2:i MD:Z rl:i
With the original aligned bam file I got:
$ samtools view ./bcalls_hac-mod_np.bam | grep '^51ea748d-14f6-4dce-97bc-789706a92b02' | cut -f12- | sed -E 's,(..:.):[^\t]*,\1,g' qs:f du:f ns:i ts:i mx:i ch:i st:Z rn:i fn:Z sm:f sd:f sv:Z dx:i RG:Z MN:i MM:Z ML:B NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i s2:i MD:Z rl:i SA:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i s2:i MD:Z rl:i SA:Z qs:f du:f ns:i ts:i mx:i ch:i st:Z rn:i fn:Z sm:f sd:f sv:Z dx:i RG:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i s2:i MD:Z rl:i SA:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i s2:i MD:Z rl:i SA:Z qs:f du:f ns:i ts:i mx:i ch:i st:Z rn:i fn:Z sm:f sd:f sv:Z dx:i RG:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i MD:Z rl:i qs:f du:f ns:i ts:i mx:i ch:i st:Z rn:i fn:Z sm:f sd:f sv:Z dx:i RG:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i s2:i MD:Z rl:i SA:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i MD:Z rl:i qs:f du:f ns:i ts:i mx:i ch:i st:Z rn:i fn:Z sm:f sd:f sv:Z dx:i RG:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i s2:i MD:Z rl:i SA:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i MD:Z rl:i qs:f du:f ns:i ts:i mx:i ch:i st:Z rn:i fn:Z sm:f sd:f sv:Z dx:i RG:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i s2:i MD:Z rl:i SA:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i s2:i MD:Z rl:i SA:Z qs:f du:f ns:i ts:i mx:i ch:i st:Z rn:i fn:Z sm:f sd:f sv:Z dx:i RG:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i s2:i MD:Z rl:i SA:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i s2:i MD:Z rl:i SA:Z
Thanks @matomol - I see there are duplicate NM tags in the reads in your original BAM here. This may be because you have used Dorado to basecall with a --reference (which will handle alignment), and then called dorado aligner to align again. What was the motivation for the call to dorado aligner
?
Well, I didn't know that the basecaller does the alignment too. I thought these are two separate programs that have to be called subsequently. See also my post above.
So after all, I will try nextflow variant calling right after dorado basecalling now. Let's see what happens.
Well the dorado basecaller creates a bam file without an index, and when I try to index this bam file I get an immediate error:
$ samtools index -b ./bcalls_hac-mod_np.bam [E::hts_idx_push] NO_COOR reads not in a single block at the end 5 -1 [E::sam_index] Read 'f8d506cb-4c4d-4260-a081-fb11df76168c' with ref_name='chr6', ref_length=170805979, flags=16, pos=57110062 cannot be indexed samtools index: failed to create index for "./bcalls_hac-mod_np.bam": No such file or directory
I believe this means the BAM is not sorted. Try passing your BAM straight to the workflow, it will work out whether you need sorting and indexing and so on automatically.
On Thu, 29 Aug 2024, 07:27 matomol, @.***> wrote:
Well the dorado basecaller creates a bam file without an index, and when I try to index this bam file I get an immediate error:
$ samtools index -b ./bcalls_hac-mod_np.bam
[E::hts_idx_push] NO_COOR reads not in a single block at the end 5 -1 [E::sam_index] Read 'f8d506cb-4c4d-4260-a081-fb11df76168c' with ref_name='chr6', ref_length=170805979, flags=16, pos=57110062 cannot be indexed samtools index: failed to create index for "./bcalls_hac-mod_np.bam": No such file or directory
— Reply to this email directly, view it on GitHub https://github.com/epi2me-labs/wf-human-variation/issues/212#issuecomment-2316808503, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIN6OWCT5LJLGJSS2ACJLDZT25LZAVCNFSM6AAAAABNH3XG4GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMJWHAYDQNJQGM . You are receiving this because you commented.Message ID: @.***>
Well, as you said. It's just running fine. Let's see the results in the afternoon.
@matomol Great. I've checked internally and the Dorado issue where re-alignment with dorado aligner after basecalling will be fixed in the next release. I'm closing this ticket as the problem at hand has been resolved but please open a new issue if you encounter any trouble running the workflow.
Operating System
Other Linux (please specify below)
Other Linux
Ubuntu
Workflow Version
wf-human-variation v2.3.1-g6a6daa6
Workflow Execution
Command line (Local)
Other workflow execution
No response
EPI2ME Version
No response
CLI command run
nextflow run epi2me-labs/wf-human-variation \ --out_dir './variant' \ -w './variant/workspace' \ --bam './filtered' \ --ref '/home/ai/data/reference/Nanopore-recommended/seq/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna' \ --sample_name 'Test' \ --snp \ --sv \ --mod \ --phased \ -profile standard > out.txt
Workflow Execution - CLI Execution Profile
None
What happened?
Now when I variant call this run I permanently get an error (see below), which is implausible as alignment run without any error.
When I analyse the mentioned read I see that it is an ambiguous read that occurs several times in the bam file, so maybe one of these occurences is secondary and therefore considered implausible.
When I exclude the mentioned read from the bam file from the remaining bam file an other read pops up as implausible.
Therefore this is a general problem either of the aligner or the bamstat software. Im just trying to find a workaround by filtering the bam file for ambiguous reads before variant calling it.
Relevant log output
Application activity log entry
No response
Were you able to successfully run the latest version of the workflow with the demo data?
yes
Other demo data information
No response