PacificBiosciences / sawfish

Structural variant discovery and genotyping from mapped PacBio HiFi data
Other
22 stars 1 forks source link

compatability of minimap2 #1

Closed jamesc99 closed 2 weeks ago

jamesc99 commented 3 weeks ago

Hi there,

Thanks for developing this tool!

I was trying to run minimap2-aligned BAM files with Sawfish and encountered an error saying 'no rq tag'. I checked PacBio's doc and found it is the tag describing the read expected accuracy.

I am wondering if I can run a minimap2-aligned BAM file on Sawfish, like adding specific parameter, etc.

Thank you! Ryan

Error log file:

[2024-08-21][16:53:50][sawfish][INFO] Starting sawfish
[2024-08-21][16:53:50][sawfish][INFO] cmdline: sawfish discover --threads 8 --bam ../../../test_data/hg002_pacbio_minimap2.bam --ref /users/u250191/ryan_scratch_ln/reference/human-grch38.fasta
[2024-08-21][16:53:50][sawfish][INFO] Running on 8 threads
[2024-08-21][16:53:51][sawfish][INFO] Writing discover settings to file: 'sawfish_discover_output/discover.settings.json'
[2024-08-21][16:53:51][sawfish][INFO] Reading reference genome from file '/users/u250191/ryan_scratch_ln/reference/human-grch38.fasta'
[2024-08-21][16:55:18][sawfish][INFO] Processing alignment file '../../../test_data/hg002_pacbio_minimap2.bam'
[2024-08-21][17:00:18][sawfish][INFO] Processed alignments on 1,852,004 of 3,088,286 ref genome kb (59%)
[2024-08-21][17:03:32][sawfish][INFO] Finished processing all alignments
[2024-08-21][17:03:32][sawfish][INFO] Writing max sv depth genome regions to bed file: 'sawfish_discover_output/max.depth.bed'
[2024-08-21][17:03:32][sawfish][INFO] Getting depth bin GC content
[2024-08-21][17:03:36][sawfish][INFO] Getting GC depth correction levels
[2024-08-21][17:03:36][sawfish][INFO] Completing breakpoint observation clustering
[2024-08-21][17:03:37][sawfish][INFO] Finished breakpoint observation clustering
[2024-08-21][17:03:37][sawfish][INFO] Consolidating adjacent indel candidates
[2024-08-21][17:03:37][sawfish][INFO] Finding large insertion candidates
[2024-08-21][17:03:37][sawfish][INFO] Writing breakpoint_cluster debug bed file: 'sawfish_discover_output/debug.breakpoint_clusters.bed'
[2024-08-21][17:03:37][sawfish][INFO] Refining SV Candidates
[2024-08-21][17:03:37][sawfish][INFO] Writing debug assembly region bed file: 'sawfish_discover_output/assembly.regions.bed'
[2024-08-21][17:03:37][sawfish][INFO] Starting refinement of breakpoint evidence clusters 
thread '<unnamed>' panicked at src/bam_utils.rs:412:5:
Missing rq tag in read m54329_180926_230856/7012789/ccs
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
/var/spool/slurm/d/job538394/slurm_script: line 34: 805518 Aborted                 (core dumped) sawfish discover --threads 8 --bam $input_bam --ref $ref38

codes for minimap2 mapping: $minimap -Y -ax map-hifi -R "@RG\tSM:${INFOR}\tID:${RG_ID}" $ref38 $FASTQ_FILE --MD -t 8

ctsa commented 3 weeks ago

Hi Thanks for reporting this issue.

I should first note that we strongly recommend pbmm2 for all sawfish analysis, as discussed in a bit more detail here:

https://github.com/PacificBiosciences/sawfish/blob/main/docs/user_guide.md#read-mapper

It may be possible to use minimap2 with the appropriate flags to prevent hard-clipping but this is untested/unsupported.

The good news for this specific issue with the rq tag is we've just released an update (0.12.2) yesterday to relax this requirement, so a version update should take care of this specific error.

zengxi-hada commented 3 weeks ago

I am also experiencing the same error while using this bam file: https://s3-us-west-2.amazonaws.com/human-pangenomics/working/HPRC_PLUS/HG002/analysis/aligned_reads/hifi/GRCh38/HG002_aligned_GRCh38_winnowmap.sorted.bam.

zengxi-hada commented 2 weeks ago

I just runned version 0.12.2, but get new errors: thread '' panicked at src/refine_sv/trimmed_reads.rs:197:5: assertion failed: !is_hard_clipped(&record.cigar())

Below is the full running log:

[2024-08-25][09:31:28][sawfish][INFO] Starting sawfish
[2024-08-25][09:31:28][sawfish][INFO] cmdline: /home/xiz978/bin/BCH/sawfish-v0.12.2-x86_64-unknown-linux-gnu/bin/sawfish discover --threads 16 --ref /n/data1/bch/genetics/lee/reference/hg38/Homo_sapiens_assembly38.fasta --bam /n/storage_test/xiz978/Wangziying/HG002/hifi/HG002_aligned_GRCh38_winnowmap.sorted.bam --output-dir HG002_discover_dir
[2024-08-25][09:31:28][sawfish][INFO] Running on 16 threads
[2024-08-25][09:31:29][sawfish][INFO] Writing discover settings to file: 'HG002_discover_dir/discover.settings.json'
[2024-08-25][09:31:29][sawfish][INFO] Reading reference genome from file '/n/data1/bch/genetics/lee/reference/hg38/Homo_sapiens_assembly38.fasta'
[2024-08-25][09:31:58][sawfish][INFO] Processing alignment file '/n/storage_test/xiz978/Wangziying/HG002/hifi/HG002_aligned_GRCh38_winnowmap.sorted.bam'
[2024-08-25][09:36:58][sawfish][INFO] Processed alignments on 1,181,630 of 3,099,922 ref genome kb (38%)
[2024-08-25][09:42:05][sawfish][INFO] Processed alignments on 2,595,202 of 3,099,922 ref genome kb (83%)
[2024-08-25][09:43:24][sawfish][INFO] Finished processing all alignments
[2024-08-25][09:43:24][sawfish][INFO] Writing max sv depth genome regions to bed file: 'HG002_discover_dir/max.depth.bed'
[2024-08-25][09:43:24][sawfish][INFO] Getting depth bin GC content
[2024-08-25][09:43:27][sawfish][INFO] Getting GC depth correction levels
[2024-08-25][09:43:27][sawfish][INFO] Completing breakpoint observation clustering
[2024-08-25][09:43:28][sawfish][INFO] Finished breakpoint observation clustering
[2024-08-25][09:43:28][sawfish][INFO] Consolidating adjacent indel candidates
[2024-08-25][09:43:28][sawfish][INFO] Finding large insertion candidates
[2024-08-25][09:43:28][sawfish][INFO] Writing breakpoint_cluster debug bed file: 'HG002_discover_dir/debug.breakpoint_clusters.bed'
[2024-08-25][09:43:28][sawfish][INFO] Refining SV Candidates
[2024-08-25][09:43:28][sawfish][INFO] Writing debug assembly region bed file: 'HG002_discover_dir/assembly.regions.bed'
[2024-08-25][09:43:28][sawfish][INFO] Starting refinement of breakpoint evidence clusters
thread '<unnamed>' panicked at src/refine_sv/trimmed_reads.rs:197:5:
assertion failed: !is_hard_clipped(&record.cigar())
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
/var/spool/slurmd/job45032830/slurm_script: line 9: 17052 Aborted                 /home/xiz978/bin/BCH/sawfish-v0.12.2-x86_64-unknown-linux-gnu/bin/sawfish discover --threads 16 --ref /n/data1/bch/genetics/lee/reference/hg38/Homo_sapiens_assembly38.fasta --bam /n/storage_test/xiz978/Wangziying/HG002/hifi/HG002_aligned_GRCh38_winnowmap.sorted.bam --output-dir HG002_discover_dir
[2024-08-25][09:43:41][sawfish][INFO] Starting sawfish
[2024-08-25][09:43:41][sawfish][INFO] cmdline: /home/xiz978/bin/BCH/sawfish-v0.12.2-x86_64-unknown-linux-gnu/bin/sawfish joint-call --threads 16 --sample HG002_discover_dir --output-dir HG002_joint_call_dir
[2024-08-25][09:43:41][sawfish][INFO] Running on 16 threads
[2024-08-25][09:43:41][sawfish][INFO] Reading reference genome from file '/n/data1/bch/genetics/lee/reference/hg38/Homo_sapiens_assembly38.fasta'
[2024-08-25][09:44:10][sawfish][INFO] Reading all sample discovery input
thread '<unnamed>' panicked at /var/lib/bamboo/.cargo/registry/src/index.crates.io-6f17d22bba15001f/unwrap-1.2.1/src/lib.rs:55:25:
ctsa commented 2 weeks ago

Thanks @zengxi-hada,

I will take a note to make a more clear error message. The error indicates that your input bam uses hard-clipping for split reads, which sawfish doesn't support at this time. As enumerated above, we support and test the SV caller for pbmm2 only. It may be possible to get the analysis working with minimap, but at minimum you would need to change the settings to prevent hard-clipping of split reads, and even for that case minimap is not tested so there may be additional complications, I would recommend using pbmm2, some discussion on this in the user guide can be found here:

https://github.com/PacificBiosciences/sawfish/blob/main/docs/user_guide.md#read-mapper

ctsa commented 2 weeks ago

Closing the original issues as answered, tracking msg improvement in #2.