Parsoa / SVDSS

Improved structural variant discovery in accurate long reads using sample-specific strings (SFS)
MIT License
42 stars 4 forks source link

Seg default #33

Open Flooooooooooooower opened 3 months ago

Flooooooooooooower commented 3 months ago

When I used LRA to align the HG002 ONT data and ran the following code, I did not get any results. However, other alignment software worked fine.

/home/xxx/00.Software/SVDSS/SVDSS_linux_x86-64 index 
/home/xxx/00.Software/SVDSS/SVDSS_linux_x86-64 smooth --reference /home/xxx/01.Reference/Human_GRCh37/GCF_000001405.25_GRCh37.p13_genomic.fa --bam ${sample} --threads 20 > ${path}/${name}.smoothed.bam
    samtools index -@ 20 ${path}/${name}.smoothed.bam
    /home/xxx/00.Software/SVDSS/SVDSS_linux_x86-64 search --index /home/xxx/01.Reference/Human_GRCh37/GCF_000001405.25_GRCh37.p13_genomic.fmd --bam ${path}/${name}.smoothed.bam --threads 20 >${path}/${name}.txt
    /home/xxx/00.Software/SVDSS/SVDSS_linux_x86-64 call --reference 
(svdss) [xxx@hpc01 08.lra]$ /home/xxx/00.Software/SVDSS/SVDSS_linux_x86-64 call --reference /home/xxx/01.Reference/Human_GRCh37/GCF_000001405.25_GRCh37.p13_genomic.fa --bam HG002.ONT_lra.smoothed.bam --sfs ./HG002.ONT_lra.txt  --threads 20 --min-cluster-weight 2  --clipped   >calls.vcf
[2024-08-18 17:58:25.669] [stderr] [info] Loading reference genome from /home/jiaoh/01.Reference/Human_GRCh37/GCF_000001405.25_GRCh37.p13_genomic.fa..
[2024-08-18 17:58:47.035] [stderr] [info] Loading SFSs from ./HG002.ONT_lra.txt..
[2024-08-18 17:59:02.036] [stderr] [info] Loaded 15907313 SFSs from 1768750 reads.
[2024-08-18 17:59:02.099] [stderr] [info] Placing SFSs on reference genome
[2024-08-18 18:01:11.676] [stderr] [info] 107/0/0 unplaced SFSs. 0 erroneus SFSs. 3500774 clipped SFSs.
[2024-08-18 18:01:11.676] [stderr] [info] Clustering 1054888 SFSs..
[2024-08-18 18:01:12.731] [stderr] [info] Maximum extended SFS length: 49162bp. Using separation distance: 54078bp.
[2024-08-18 18:01:13.337] [stderr] [info] Extending 314027 clusters..
[2024-08-18 18:03:18.167] [stderr] [info] Filtered 1247 SFSs. Filtered 214848 clusters. Filtered 39 global clusters.
[2024-08-18 18:03:18.167] [stderr] [info] Calling SVs from 314027 clusters..
Segmentation fault (core dumped)

I would greatly appreciate it if you could help me identify the error!!!!

ldenti commented 3 months ago

Hi, I'll try lra on some fastq I have and I'll let you know.

Did you see any warning or error while running the previous steps (i.e., smooth and search)?

Best,

ldenti commented 2 months ago

I tried lra but after smoothing, the samtools index failed

[E::bam_read1] CIGAR and query sequence lengths differ for m64062_200604_115437/45811539/ccs
samtools index: failed to create index for "hg007.lra.smoothed.bam"

I checked that read in the lra bam and I found that it's cigar is: 68=1X4=...18=5D5S. I think that having a deletion followed by a soft clip may create issue in the smoothing step. Can you please check if you have a similar error while indexing the smoothed bam file?

If not, can you please share where I can get the fastq you are using? I think I can replicate the error that way.

Best,

Flooooooooooooower commented 2 months ago

Hello, the data I used is the ultra-long ONT data of HG002: ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/UCSC_Ultralong_OxfordNanopore_Promethion/GM24385_2.fastq.gz 885e56952b657af9abc7ca644151411e. However, I did not encounter the samtools index failure error. Below is the command I used for alignment with LRA:

#apptainer exec ${container}/lra.sif lra index -ONT ${ref}
zcat ${sample} | apptainer exec ${container}/lra.sif lra align -ONT ${ref} /dev/stdin -t ${threads} -p s --printMD -SkipH >${output}/${name}_lra.align.sam
apptainer exec ${container}/AssemblyTools_v1.sif samtools sort -@ ${threads}  -O BAM -o ${output}/${name}_lra.sort.bam ${output}/${name}_lra.align.sam
apptainer exec ${container}/AssemblyTools_v1.sif samtools index  -@ ${threads} ${output}/${name}_lra.sort.bam
ldenti commented 2 months ago

Hi, the issue is that in some regions (like the one in the screenshot) potential SV are very long. image

When SVDSS try to create the consensus (via abpoa), it fails due to not enough memory (or something like that). These cases are not expected with HiFi reads that are way shorter than ONT. The only patch I can think of for now is to filter out those regions. I prepared this patched binary. I tried it on the lra bam and it should work - please try and let me know.

However, I'd like to point out that we developed SVDSS to work with HiFi data and not ONT - although it could work, results may not be the best. We are currently working on extending the approach to ONT, but it's not ready yet.