TurakhiaLab / ROADIES

Tool for fully-automated inference of species trees from raw genome assemblies
https://turakhia.ucsd.edu/ROADIES/
MIT License
17 stars 1 forks source link

failing on large input sequences #10

Closed NullModel closed 8 months ago

NullModel commented 8 months ago

Given three input sequences:

`-rw-r--r-- 1 2867378923 Oct 3 20:18 GCA_011078405.1.fa

-rw-r--r-- 1 2960276849 Oct 3 20:18 GCA_016695395.2.fa

-rw-r--r-- 1 6152963590 Oct 6 21:20 GCA_020510985.1.fa ` This large sequence size:

`faSize GCA_020510985.1.fa

6032307556 bases (52780824 N's 5979526732 real 3026980659 upper 2952546073 lower) in 512 sequences in 1 files

%48.95 masked total, %49.38 masked real ` Is crashing out in lastz:

`[Fri Oct 13 15:57:16 2023] rule lastz: input: results/samples/out.fa, /private/groups/gbrowser/VGP/bigOnes/GCA_020510985.1.fa output: results/alignments/GCA_020510985.1.maf jobid: 7 benchmark: results/benchmarks/GCA_020510985.1.lastz.txt reason: Missing output files: results/alignments/GCA_020510985.1.maf; Input files updated by another job: results/samples/out.fa wildcards: sample=GCA_020510985.1 threads: 2 resources: tmpdir=/data/tmp

FAILURE: in load_fasta_sequence for /private/groups/gbrowser/VGP/bigOnes/GCA_020510985.1.fa, sequence length 4,151,866,407+358,931,967 exceeds maximum (4,294,967,285) [Fri Oct 13 15:59:45 2023] Error in rule lastz: jobid: 7 input: results/samples/out.fa, /private/groups/gbrowser/VGP/bigOnes/GCA_020510985.1.fa output: results/alignments/GCA_020510985.1.maf shell:

            if [[ "/private/groups/gbrowser/VGP/bigOnes/GCA_020510985.1.fa" == *.gz ]]; then
                    lastz_32 <(gunzip -dc /private/groups/gbrowser/VGP/bigOnes/GCA_020510985.1.fa)[multiple] results/samples/out.fa --coverage=85 --continuity=85 --filter=identity:65 --format=maf --output=results/alignments/GCA_020510985.1.maf --ambiguous=iupac --step=1 --notransition --queryhspbest=20
            else
                    lastz_32 /private/groups/gbrowser/VGP/bigOnes/GCA_020510985.1.fa[multiple] results/samples/out.fa --coverage=85 --continuity=85 --filter=identity:65 --format=maf --output=results/alignments/GCA_020510985.1.maf --ambiguous=iupac --step=1 --notransition --queryhspbest=20
            fi`

Is this a lastz_32 limitation ? There is a 64bit version.

ang037 commented 8 months ago

I have fixed the script now with the correct lastz_32 version supporting large genomes with >4GB (we had earlier tested and verified the working of large mammalian genomes having >4GB size. However, the recent script did not consider the updated lastz_32 version).