brendanofallon / jovian

Detecting variants in NGS short read data by sequence-to-sequence modeling
Other
7 stars 1 forks source link

Possible for long reads or tool to shred long alignments to non-overlapping records? #40

Closed jelber2 closed 8 months ago

jelber2 commented 8 months ago

Hi,

I have some HG002 long reads that I have error-corrected to almost short-read quality - https://gist.github.com/jelber2/02fdca09ed3b12b650b512ba200f3b67

I was wondering if there might be a tool to shred my alignments say from 20000-bp (just for example) to 150-bp or whatever the maximum read length allowed by jenever is. I guess I could try to make either a Rust (or Julia) tool to do this, but I was asking just in case someone has something already. I know of some tools for unaligned reads, but not for already-aligned reads.

Best, Jean

jelber2 commented 8 months ago

Ok, so shred.sh from BBTools/BBMap can shred alignments. https://github.com/BioInfoTools/BBMap/blob/master/sh/shred.sh I just tried it with a toy example

$ minimap2 -acx sr <(echo -e ">chr1\nACTGACTG") <(echo -e "@1\nACT\n+\n???")|samtools sort > test.bam

[M::mm_idx_gen::0.001*3.97] collected minimizers
[M::mm_idx_gen::0.001*2.48] sorted minimizers
[M::main::0.001*2.46] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::0.001*2.45] mid_occ = 1000
[M::mm_idx_stat] kmer size: 21; skip: 11; is_hpc: 0; #seq: 1
[M::mm_idx_stat::0.001*2.42] distinct minimizers: 0 (-nan% are singletons); average occurrences: -nan; average spacing: inf; total length: 8
[M::worker_pipeline::0.002*2.19] mapped 1 sequences
[M::main] Version: 2.26-r1175
[M::main] CMD: minimap2 -acx sr /dev/fd/63 /dev/fd/62
[M::main] Real time: 0.002 sec; CPU: 0.004 sec; Peak RSS: 0.004 GB

$ shred.sh in=test.bam length=1 minlength=1 out=test2.bam ow=t

Executing jgi.Shred [in=test.bam, length=1, minlength=1, out=test2.bam, ow=t]
Could not find sambamba.
Found samtools 1.9
Time:                           0.076 seconds.
Reads Processed:           1    0.01k reads/sec
Bases Processed:           3    0.00m bases/sec
Reads Out:                 3
Bases Out:                 3

$ samtools view test2.bam 
1_0-0   4   *   0   0   *   *   0   0   A   ?
1_1-1   4   *   0   0   *   *   0   0   C   ?
1_2-2   4   *   0   0   *   *   0   0   T   ?

But best to install either through conda on the bioconda channel or from https://sourceforge.net/projects/bbmap/

jelber2 commented 8 months ago

Oh, just realized that it is a ubam.

jelber2 commented 8 months ago

Ok, so ignore the previous examples with BAM files and shredding as it was a poor example with 0 aligned regions anyway. EDIT: If you do not make the aligned portions smaller, then jenever seems to complain about there being too few reads and throws out quite a few potential variants hence the shredding that follows I did make a simple Julialang script that shreds a BAM file and so far seems to at least pass samtools's internal parsing. I am waiting for a GPU to become available on our cluster to test it :) and I am new to Julia, so perhaps this implementation could be sped up and is inefficient, and of course assumes all qualities are Q30 (ASCII ?), but who cares...

shredBAM.jl

import Pkg; Pkg.add("XAM");
using XAM
using Base.Iterators: partition

function split_cigar(cigar_string::String, split_length::Int)
    # Regular expression to match CIGAR operations
    cigar_pattern = r"(\d+)([MIDNSHP=X])"
    operations = collect(eachmatch(cigar_pattern, cigar_string))

    # List to hold the split CIGAR strings
    split_cigars = String[]
    current_cigar = ""
    current_length = 0

    for op in operations
        length = parse(Int, op.captures[1])
        operation = op.captures[2]
        while length > 0
            if operation == "D"
                # For deletions, add the operation to the current cigar
                # but do not increase the current_length
                current_cigar *= "$(length)$(operation)"
                break # Deletion is fully processed
            else
                # Determine the length of the current operation to add
                add_length = min(length, split_length - current_length)
                current_cigar *= "$(add_length)$(operation)"
                length -= add_length
                current_length += add_length

                # If the current split reaches the desired length, start a new split
                if current_length == split_length
                    push!(split_cigars, current_cigar)
                    current_cigar = ""
                    current_length = 0
                end
            end
        end
    end

    # Add any remaining operations to the last split
    if current_cigar != ""
        push!(split_cigars, current_cigar)
    end

    return split_cigars
end

# Example usage

function split_BAM_records(input_file::String)
    reader = open(BAM.Reader, input_file)
    for record in reader
        if BAM.ismapped(record)
            if BAM.isprimaryalignment(record)
                # Define your logic to break the record into smaller parts
                # For example, split the record based on some criteria
                smaller_sequences = break_into_smaller_sequences(record, parse(Int, ARGS[2]))
                smaller_cigars = split_cigar(String(BAM.cigar(record)), parse(Int, ARGS[2]))
                smaller_templates = template_names(record, parse(Int, ARGS[2]))
                smaller_pos = new_positions(record, parse(Int, ARGS[2]))
                # Write the new smaller records to the output file
                for (small_seq, small_cigar, small_temp, small_pos) in zip(smaller_sequences, smaller_cigars, smaller_templates, smaller_pos)
                    println("$small_temp", "\t", 0, "\t", BAM.refname(record), "\t", "$small_pos", "\t", BAM.mappingquality(record), "\t", "$small_cigar", "\t*\t0\t0\t", "$small_seq", "\t", join(fill("?", length("$small_seq"))))
                end
            end
        end
    end
    close(reader)
    close()
end

function template_names(record::BAM.Record, segment_length::Int)
    # Get the full sequence of the read, including unaligned regions
    full_sequence = BAM.sequence(record)
    full_length = length(full_sequence)

    # Calculate the number of segments
    num_segments = ceil(Int, full_length / segment_length)

    # Placeholder for new records
    new_records = []

    # Generate new records for each segment
    for i in 1:num_segments
        # template name
        new_template_name = string(BAM.tempname(record)) * "_" * string(i)
        # Add the new record to the array
        push!(new_records, new_template_name)
    end

    return new_records
end

function break_into_smaller_sequences(record::BAM.Record, segment_length::Int)
    # Get the full sequence of the read, including unaligned regions
    full_sequence = BAM.sequence(record)
    full_length = length(full_sequence)

    # Calculate the number of segments
    num_segments = ceil(Int, full_length / segment_length)

    # Placeholder for new records
    new_records = []

    # Generate new records for each segment
    for i in 1:num_segments
        # Calculate the start and end of the segment
        start_pos = (i - 1) * segment_length + 1
        end_pos = min(i * segment_length, full_length)

        # Extract the segment sequence
        segment_sequence = full_sequence[start_pos:end_pos]

        # Add the new record to the array
        push!(new_records, segment_sequence)
    end

    return new_records
end

function new_positions(record::BAM.Record, segment_length::Int)
    # Get the full sequence of the read, including unaligned regions
    full_sequence = BAM.sequence(record)
    full_length = length(full_sequence)

    # Calculate the number of segments
    num_segments = ceil(Int, full_length / segment_length)

    # Placeholder for new records
    new_records = []

    # Generate new records for each segment
    for i in 1:num_segments
        # Calculate the start and end of the segment
        start_pos = (i - 1) * segment_length + 1 + BAM.position(record)

        # Add the new record to the array
        push!(new_records, start_pos)
    end

    return new_records
end

# Check if a filename is passed as a command line argument" >> {params.log}.stitch-fasta.jl
if length(ARGS) < 1
    println("Usage: julia ", PROGRAM_FILE, " <infile.BAM>", " <shredLength.integer>")
    return
end

split_BAM_records(ARGS[1])

to use (tested with julialang v1.10.2)

julia shredBAM.jl hg004.herro.pg_asm.Q30.sam1.3.SoftClip.chr20.fastq.gz.chr20.remap.bam 300 > test.sam.noheader

# add on a PG line
SHRED_LENGTH=300
INPUT=hg004.herro.pg_asm.Q30.sam1.3.SoftClip.chr20.fastq.gz.chr20.remap.bam

# this is the command to make the BAM file
cat <(samtools view -H $INPUT) <(echo -e "@PG\tID:shredBAM.jl\tPN:shredBAM.jl\tVN:0.01\tCL:julia shredBAM.jl $INPUT $SHRED_LENGTH") test.sam.noheader | \
samtools sort -@12 \
> hg004.herro.pg_asm.Q30.sam1.3.SoftClip.chr20.fastq.gz.chr20.remap.shredBAM.jl.300.bam

samtools index -@12 hg004.herro.pg_asm.Q30.sam1.3.SoftClip.chr20.fastq.gz.chr20.remap.shredBAM.jl.300.bam

Yesterday, I tried taking an error-corrected long-read BAM file, converting it to FASTQ, shredding it to 500-bp shreds, and remapping to the reference. I ended up getting the following with jmcdani20/hap.py:v0.3.12.

Type   Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  FP.al  METRIC.Recall  METRIC.Precision  METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
INDEL  ALL     11000        8736      2264      14090        5111      8          1111   1523   0.794182       0.637054          0.000568        0.706993         1.79270946681           1.47966162707
INDEL  PASS    11000        6740      4260      10221        3276      6          740    806    0.612727       0.679295          0.000587        0.644296         1.79270946681           1.37773549001
SNP    ALL     71659        68788     2871      73484        4538      24         1043   978    0.959935       0.938225          0.000327        0.948956         2.31007250727           2.28175364972           1.87833982884              1.82199938528
SNP    PASS    71659        51489     20170     53932        2325      12         427    322    0.718528       0.956881          0.000223        0.82075          2.31007250727           2.31213163065           1.87833982884              1.80348413937

For jenever, I used the following command, which I think I should have used the entire chr20 as the region in the BED file (I think, since METRIC.Frac_NA is so low). It took 3 hours on an A100 40GB card.

jenever call -r /mnt/clair3-training/Homo_sapiens_GRCh38_no_alt.chr20.fa \
  --threads 48 \
  -m /mnt/git/jovian/models/100M_s28_cont_mapsus_lolr2_epoch2.model \
  --bed HG004_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.20.bed \
  --bam hg004.herro.pg_asm.Q30.sam1.3.SoftClip.chr20.fastq.gz.chr20.remap.bam \
  -c /mnt/git/jovian/models/s28ce40_bamfix.model \
  -v hg004.herro.pg_asm.Q30.sam1.3.SoftClip.chr20.fastq.gz.chr20.remap.bam.jenever.vcf

Just out of curiosity, what were 100M_s28_cont_mapsus_lolr2_epoch2.model and s28ce40_bamfix.model trained on? I thought maybe it might not be trained on HG004, hence that is why I used HG004.

brendanofallon commented 8 months ago

Interesting, better than I would have expected for SNPs! Jenever was trained only on Illumina WGS short read data (150bp), so long read / shredded long reads are definitely out of distribution. I only have access to Illumina data at my current position so it's not easy for me to fine-tune a model for longer reads, but if you have a lot of Genome-in-a-Bottle data for PacBio or ONT I'd be happy to help you train or fine-tune a model for that purpose. (HG004 was used in the training data, but no regions from chromosomes 21 or 22 - so those areas are best used for validation)

jelber2 commented 8 months ago

Do you have an email address where I can send you DropBox links to HG001-HG004 CRAM and CRAI files and a reference? The CRAM files are for ~35x coverage for Herro reads (these are Oxford Nanopore SUP accuracy reads corrected with the Herro Deep Learning model from https://github.com/lbcb-sci/herro). The data above are even more corrected, but I worry that maybe there are some random errors introduced as a result of additional correcting even though the error rates seemed to go down based on this gist (https://gist.github.com/jelber2/02fdca09ed3b12b650b512ba200f3b67).

Note that I have to add fake quality scores of Q30 to the alignments as the outputs of Herro and other error-correction is FASTA. Q30 seems pretty reasonable for substitution sites, but not for the indel sites based on the previously mentioned gist. I will try using chr21 this time around instead of chr20 as I had before.

jelber2 commented 8 months ago

On second thought, this would be easier (CRAMs are about 400-600MB, reference about 800MB, MB=megabytes). Should be all autosomes although there maybe be some mitochondrial reads missing and not quite sure about the X and Y chromosomes in the CRAM files. If you need them with soft-clipping, then let me know. Unfortunately, that is what I have at the moment in BAM (I converted to CRAM, to go from a whopping 20GB BAM to about 500MB CRAM). When our cluster is less active, maybe I could do aligning with soft-clipping if needed.

# Get the CRAM files
wget 'https://www.dropbox.com/scl/fi/24cj6s5v88uokidr4x75m/hg001.herro.Q30.sam1.3.noSoftClip.cram?rlkey=upncjlzjg88tohewlhk3pdjpm' -O hg001.herro.Q30.sam1.3.noSoftClip.cram
wget 'https://www.dropbox.com/scl/fi/t7z4ku3ht0gsape3jpfr5/hg002.herro.Q30.sam1.3.noSoftClip.cram?rlkey=fkubcw7en094pb6mos0ju1135' -O hg002.herro.Q30.sam1.3.noSoftClip.cram
wget 'https://www.dropbox.com/scl/fi/v66d7bmc018s5rtqfympz/hg003.herro.Q30.sam1.3.noSoftClip.cram?rlkey=1qym0i4q2pcncquveac2vnrq8' -O hg003.herro.Q30.sam1.3.noSoftClip.cram
wget 'https://www.dropbox.com/scl/fi/4zvipkbxv8llmmu1btl0i/hg004.herro.Q30.sam1.3.noSoftClip.cram?rlkey=mhb8dw4md5j7ughpqqhvuliz5' -O hg004.herro.Q30.sam1.3.noSoftClip.cram

# get the CRAI files
wget 'https://www.dropbox.com/scl/fi/l1ncyk4nrbxmcd52e5sin/hg001.herro.Q30.sam1.3.noSoftClip.cram.crai?rlkey=r1fzlgt5xdoegritfqzc12q63' -O hg001.herro.Q30.sam1.3.noSoftClip.cram.crai
wget 'https://www.dropbox.com/scl/fi/p4o7xygz2cffby949fn81/hg002.herro.Q30.sam1.3.noSoftClip.cram.crai?rlkey=78l5d2cdz2u5wtjmbgc421o5d' -O hg002.herro.Q30.sam1.3.noSoftClip.cram.crai
wget 'https://www.dropbox.com/scl/fi/g28i9nzk4jsqcfsoeqkf7/hg003.herro.Q30.sam1.3.noSoftClip.cram.crai?rlkey=edwr7gdcpikpeohf3rb3oz805' -O hg003.herro.Q30.sam1.3.noSoftClip.cram.crai
wget 'https://www.dropbox.com/scl/fi/521x0zt3cjqsz99f1el83/hg004.herro.Q30.sam1.3.noSoftClip.cram.crai?rlkey=cc2tp7k1m36pto6c3n1drzhzm' -O hg004.herro.Q30.sam1.3.noSoftClip.cram.crai

# Get the Reference
wget 'https://www.dropbox.com/scl/fi/aqftq338k8zsja839iuw0/Homo_sapiens_GRCh38_no_alt.fa.fai?rlkey=v5bvcy3jcf2szpg50j866mhwm' -O Homo_sapiens_GRCh38_no_alt.fa.fai
wget 'https://www.dropbox.com/scl/fi/nfm8lzvwvg4cc8t97trl3/Homo_sapiens_GRCh38_no_alt.fa.gz?rlkey=re13nt7kydb82jxy2kzn6om9i' -O Homo_sapiens_GRCh38_no_alt.fa.gz

# decompress reference
gunzip Homo_sapiens_GRCh38_no_alt.fa.gz

# CRAM to BAM
THREADS=4 #not sure if more than 4 really makes a difference or not?
samtools view -@${THREADS} -b hg003.herro.Q30.sam1.3.noSoftClip.cram -T Homo_sapiens_GRCh38_no_alt.fa > test.bam
brendanofallon commented 8 months ago

Hmm, 400-600MB is very small for a WGS file (typical sizes for human WGS is 20-40GB for CRAM, and about ~100GB as BAM), are these very low depth, or part of a hybrid capture experiment?

jelber2 commented 8 months ago

Maybe just try to Download one CRAM and the CRAI and ref then double check with samtools depth on a single chrom? I think I did the conversion correctly, but yeah I too was surprised. N50 I think varied but somewhere around 20Kbp?

brendanofallon commented 8 months ago

Oh, I see, all the base qualities are all '?' (ascii 63), that makes them easy to compress! The depths seem fine - 30-40X or so - that should be plenty. I haven't really considered long read data before so I'll have to look into the best way to make this work - probably the windowing logic will need to be adjusted somehow. Not sure about soft-clipping for ONT data, does it affect only little bits at the end of the reads or is it more complicated?

jelber2 commented 8 months ago

Yes, there are fake Phred Q30 (ASCII+33) scores. I am not sure what to say about the soft clipping. What you have right now has hard clipping as that is what I had mapped at the moment, and I figured it would take quite a while to upload just the gzipped FASTA/FASTQ. Unfortunately, I do not think you can use the script I wrote for shredding as input for training because the script was meant for soft clipping and not the hard clipping that I gave you. Would you need the soft-clipped alignments? I could also play around maybe to update my script to properly deal with hard clips (I can try) as then you would just need to play around with the hard clipped alignments you have and then shred them to 150 bp or whatever you desire.

jelber2 commented 8 months ago

As far as I can tell, the soft clipping is at the read ends? EDIT: My shredding tool seems to be throwing errors with jenver, so I guess at least for me testing, the only thing I can do is to simply convert BAM to FASTQ, shred to 500 bp and map?

EDIT2: I seemed to have an error because jenever wanted regions in the BED file, and I selected something like 21\t0\t46000000 , and it was throwing errors with a single region (but not multiple regions) because it could not make windows I think?

EDIT3: So, I copied the wrong version of the script that had a bug that I thought I squashed. Don't worry about EDIT and EDIT2 as they seem to be working with the bug squashed.

jelber2 commented 8 months ago

This is the current version of the shredBAM.jl source code https://gist.github.com/jelber2/47129820373474a768dacabc0e686fdc

brendanofallon commented 8 months ago

Thanks Jean! I'll definitely look into long-read support in Jenever, it's a great idea and something that's not handled well now, but I think it'll take a bit more refactoring to do it right and I'll need to think about the best way to tackle it.