epi2me-labs / wf-human-variation

Other
103 stars 45 forks source link

Impausible Reads after Dorado Aligner #212

Closed matomol closed 2 months ago

matomol commented 2 months ago

Operating System

Other Linux (please specify below)

Other Linux

Ubuntu

Workflow Version

wf-human-variation v2.3.1-g6a6daa6

Workflow Execution

Command line (Local)

Other workflow execution

No response

EPI2ME Version

No response

CLI command run

nextflow run epi2me-labs/wf-human-variation \ --out_dir './variant' \ -w './variant/workspace' \ --bam './filtered' \ --ref '/home/ai/data/reference/Nanopore-recommended/seq/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna' \ --sample_name 'Test' \ --snp \ --sv \ --mod \ --phased \ -profile standard > out.txt

Workflow Execution - CLI Execution Profile

None

What happened?

  1. I basecalled pod5 files by dorade mode sup
  2. I aligned the basecalled file Everything went smoothly.

Now when I variant call this run I permanently get an error (see below), which is implausible as alignment run without any error.

When I analyse the mentioned read I see that it is an ambiguous read that occurs several times in the bam file, so maybe one of these occurences is secondary and therefore considered implausible.

When I exclude the mentioned read from the bam file from the remaining bam file an other read pops up as implausible.

Therefore this is a general problem either of the aligner or the bamstat software. Im just trying to find a workaround by filtering the bam file for ambiguous reads before variant calling it.

Relevant log output

Command executed:

  samtools view -b -h -L allChromosomes.bed reads.bam | bamstats             -s Test             -i Test.per-file-runids.txt             -l Test.basecallers.tsv             --histogram Test-histograms             -u             -f Test.flagstat.tsv             --threads 3             - | gzip > "Test.readstats.tsv.gz"

Command error:
  Read '51ea748d-14f6-4dce-97bc-789706a92b02' appears to contain implausible alignment information

Application activity log entry

No response

Were you able to successfully run the latest version of the workflow with the demo data?

yes

Other demo data information

No response

SamStudio8 commented 2 months ago

@matomol Can you provide more information about your basecalling and alignment steps. Specifically what commands were run and what version of Dorado.

matomol commented 2 months ago

$ dorado --version [2024-08-28 12:47:41.624] [info] Running: "--version" 0.7.3+6e6c45c

dorado basecaller hac,5mCG_5hmCG ./pod5 --reference /reference/Nanopore-recommended/seq/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna > ./$OUTPUT_FILE

dorado aligner /reference/Nanopore-recommended/seq/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna $OUTPUT_DIR --output-dir $SAVE_DIR

SamStudio8 commented 2 months ago

@matomol I have deleted your comment as I do not know what the nature of the shared data is. If this is data from a reference genome such as HG002, let me know and we will be able to look at it.

In your "Relevant log output" section there should be a work directory path. I'd like to see the tags for one of the implausible reads:

samtools view /path/to/your/bamstats/working/dir/reads.bam | grep '^51ea748d-14f6-4dce-97bc-789706a92b02' | cut -f12- | sed -E 's,(..:.):[^\t]*,\1,g'
matomol commented 2 months ago

Sorry I have no idea where the file "/path/to/your/bamstats/working/dir/reads.bam" resides. There are so many "reads.bam" files in the "variant/workspace" folder.

Is it possible to run this command with the aligned bam file too?

matomol commented 2 months ago

With the original aligned bam file I got:

$ samtools view ./bcalls_hac-mod_np.bam | grep '^51ea748d-14f6-4dce-97bc-789706a92b02' | cut -f12- | sed -E 's,(..:.):[^\t]*,\1,g' qs:f du:f ns:i ts:i mx:i ch:i st:Z rn:i fn:Z sm:f sd:f sv:Z dx:i RG:Z MN:i MM:Z ML:B NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i s2:i MD:Z rl:i SA:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i s2:i MD:Z rl:i SA:Z qs:f du:f ns:i ts:i mx:i ch:i st:Z rn:i fn:Z sm:f sd:f sv:Z dx:i RG:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i s2:i MD:Z rl:i SA:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i s2:i MD:Z rl:i SA:Z qs:f du:f ns:i ts:i mx:i ch:i st:Z rn:i fn:Z sm:f sd:f sv:Z dx:i RG:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i MD:Z rl:i qs:f du:f ns:i ts:i mx:i ch:i st:Z rn:i fn:Z sm:f sd:f sv:Z dx:i RG:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i s2:i MD:Z rl:i SA:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i MD:Z rl:i qs:f du:f ns:i ts:i mx:i ch:i st:Z rn:i fn:Z sm:f sd:f sv:Z dx:i RG:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i s2:i MD:Z rl:i SA:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i MD:Z rl:i qs:f du:f ns:i ts:i mx:i ch:i st:Z rn:i fn:Z sm:f sd:f sv:Z dx:i RG:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i s2:i MD:Z rl:i SA:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i s2:i MD:Z rl:i SA:Z qs:f du:f ns:i ts:i mx:i ch:i st:Z rn:i fn:Z sm:f sd:f sv:Z dx:i RG:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i s2:i MD:Z rl:i SA:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i s2:i MD:Z rl:i SA:Z

SamStudio8 commented 2 months ago

@matomol The reads.bam we require will specifically be in the path indicated by the error message you posted. The work dir will appear underneath this section:

Command executed:

  samtools view -b -h -L allChromosomes.bed reads.bam | bamstats             -s Test             -i Test.per-file-runids.txt             -l Test.basecallers.tsv             --histogram Test-histograms             -u             -f Test.flagstat.tsv             --threads 3             - | gzip > "Test.readstats.tsv.gz"

Command error:
  Read '51ea748d-14f6-4dce-97bc-789706a92b02' appears to contain implausible alignment information

See https://labs.epi2me.io/how-to-exits/#how-do-i-know-the-exit-code-of-a-workflow for the anatomy of a Nextflow error.

matomol commented 2 months ago

This is an other read, but it caused the same error. $ samtools view /home/ai/data/240819/15378MN_AS/aligned_np/hac-mod/variant/workspace/7a/25c1cd70068a9dc2b4a1da86a4d01c/reads.bam | grep '^3b40aac0-07f2-44f2-97bc-c95219fc20db' | cut -f12- | sed -E 's,(..:.):[^\t]*,\1,g' qs:f du:f ns:i ts:i mx:i ch:i st:Z rn:i fn:Z sm:f sd:f sv:Z dx:i RG:Z MN:i MM:Z ML:B NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i s2:i MD:Z zd:i rl:i SA:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i s2:i MD:Z zd:i rl:i SA:Z qs:f du:f ns:i ts:i mx:i ch:i st:Z rn:i fn:Z sm:f sd:f sv:Z dx:i RG:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i MD:Z zd:i rl:i qs:f du:f ns:i ts:i mx:i ch:i st:Z rn:i fn:Z sm:f sd:f sv:Z dx:i RG:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i s2:i MD:Z zd:i rl:i SA:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i MD:Z zd:i rl:i qs:f du:f ns:i ts:i mx:i ch:i st:Z rn:i fn:Z sm:f sd:f sv:Z dx:i RG:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i s2:i MD:Z zd:i rl:i SA:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i s2:i MD:Z zd:i rl:i SA:Z qs:f du:f ns:i ts:i mx:i ch:i st:Z rn:i fn:Z sm:f sd:f sv:Z dx:i RG:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i s2:i MD:Z zd:i rl:i SA:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i s2:i MD:Z rl:i

SamStudio8 commented 2 months ago

With the original aligned bam file I got:

$ samtools view ./bcalls_hac-mod_np.bam | grep '^51ea748d-14f6-4dce-97bc-789706a92b02' | cut -f12- | sed -E 's,(..:.):[^\t]*,\1,g' qs:f du:f ns:i ts:i mx:i ch:i st:Z rn:i fn:Z sm:f sd:f sv:Z dx:i RG:Z MN:i MM:Z ML:B NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i s2:i MD:Z rl:i SA:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i s2:i MD:Z rl:i SA:Z qs:f du:f ns:i ts:i mx:i ch:i st:Z rn:i fn:Z sm:f sd:f sv:Z dx:i RG:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i s2:i MD:Z rl:i SA:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i s2:i MD:Z rl:i SA:Z qs:f du:f ns:i ts:i mx:i ch:i st:Z rn:i fn:Z sm:f sd:f sv:Z dx:i RG:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i MD:Z rl:i qs:f du:f ns:i ts:i mx:i ch:i st:Z rn:i fn:Z sm:f sd:f sv:Z dx:i RG:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i s2:i MD:Z rl:i SA:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i MD:Z rl:i qs:f du:f ns:i ts:i mx:i ch:i st:Z rn:i fn:Z sm:f sd:f sv:Z dx:i RG:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i s2:i MD:Z rl:i SA:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i MD:Z rl:i qs:f du:f ns:i ts:i mx:i ch:i st:Z rn:i fn:Z sm:f sd:f sv:Z dx:i RG:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i s2:i MD:Z rl:i SA:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i s2:i MD:Z rl:i SA:Z qs:f du:f ns:i ts:i mx:i ch:i st:Z rn:i fn:Z sm:f sd:f sv:Z dx:i RG:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i s2:i MD:Z rl:i SA:Z NM:i ms:i AS:i nn:i de:f tp:A cm:i s1:i s2:i MD:Z rl:i SA:Z

Thanks @matomol - I see there are duplicate NM tags in the reads in your original BAM here. This may be because you have used Dorado to basecall with a --reference (which will handle alignment), and then called dorado aligner to align again. What was the motivation for the call to dorado aligner?

matomol commented 2 months ago

Well, I didn't know that the basecaller does the alignment too. I thought these are two separate programs that have to be called subsequently. See also my post above.

So after all, I will try nextflow variant calling right after dorado basecalling now. Let's see what happens.

matomol commented 2 months ago

Well the dorado basecaller creates a bam file without an index, and when I try to index this bam file I get an immediate error:

$ samtools index -b ./bcalls_hac-mod_np.bam [E::hts_idx_push] NO_COOR reads not in a single block at the end 5 -1 [E::sam_index] Read 'f8d506cb-4c4d-4260-a081-fb11df76168c' with ref_name='chr6', ref_length=170805979, flags=16, pos=57110062 cannot be indexed samtools index: failed to create index for "./bcalls_hac-mod_np.bam": No such file or directory

SamStudio8 commented 2 months ago

I believe this means the BAM is not sorted. Try passing your BAM straight to the workflow, it will work out whether you need sorting and indexing and so on automatically.

On Thu, 29 Aug 2024, 07:27 matomol, @.***> wrote:

Well the dorado basecaller creates a bam file without an index, and when I try to index this bam file I get an immediate error:

$ samtools index -b ./bcalls_hac-mod_np.bam

[E::hts_idx_push] NO_COOR reads not in a single block at the end 5 -1 [E::sam_index] Read 'f8d506cb-4c4d-4260-a081-fb11df76168c' with ref_name='chr6', ref_length=170805979, flags=16, pos=57110062 cannot be indexed samtools index: failed to create index for "./bcalls_hac-mod_np.bam": No such file or directory

— Reply to this email directly, view it on GitHub https://github.com/epi2me-labs/wf-human-variation/issues/212#issuecomment-2316808503, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIN6OWCT5LJLGJSS2ACJLDZT25LZAVCNFSM6AAAAABNH3XG4GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMJWHAYDQNJQGM . You are receiving this because you commented.Message ID: @.***>

matomol commented 2 months ago

Well, as you said. It's just running fine. Let's see the results in the afternoon.

SamStudio8 commented 2 months ago

@matomol Great. I've checked internally and the Dorado issue where re-alignment with dorado aligner after basecalling will be fixed in the next release. I'm closing this ticket as the problem at hand has been resolved but please open a new issue if you encounter any trouble running the workflow.