Rust-Wellcome / FasMan

A re-write (+ extras) of Python scripts, used in Tree of Life, into a single Rust script.
3 stars 2 forks source link

SUBWORKFLOW: Split by size #13

Closed DLBPointon closed 5 months ago

DLBPointon commented 6 months ago

Sub workflow to split a fasta file into X size chunks where X is the requested size of the file.

splitbysize -f {100MB FASTA} -s {5MB} -o {OUTPUTDIR}

default naming to original name + "S{ITERATOR}" for split 1... 2... 3...

There will need to be check where if a scaffold = 7MB then you can't just split it at 5MB. The whole scaff needs to be saved to file. At the same time the final file may end up very small in the case of very fragmented genomes. e.g 5 scaffolds of 1000 bp?

DLBPointon commented 6 months ago

Medium as I don't know how you'd check size.

There is a validate_fasta function in the generics

Use noodles to read fasta and count the length of scaffolds? If scaff > 5MB then output() else add_to_Vec? then check vec size with and without?

DLBPointon commented 6 months ago

As of 748379d

The module is mostly complete, functions work and generates the appropriate folder structure.

TODO:

NOTES:

TESTS:

$: cargo run splitbysize -f test_data/iyAndFlav1/iyAndFlav1.20231011.decontaminated.fa -m 1000000 -o ./here -d cdna
$: for i in here/iyAndFlav1/cdna/*fasta; do ls -lh $i; grep '>' $i | wc -l ; done

-rw-r--r-- 1 dp24 staff 695K May 29 14:57 here/iyAndFlav1/cdna/iyAndFlav1_f7_cdna.fasta
8
-rw-r--r-- 1 dp24 staff 41M May 29 14:57 here/iyAndFlav1/cdna/iyAndFlav1_f8_cdna.fasta
1
-rw-r--r-- 1 dp24 staff 529K May 29 14:57 here/iyAndFlav1/cdna/iyAndFlav1_f9_cdna.fasta
2

This was with trying to save 1Mb files. However, looking into these they are legit. It is trying to save as close to 1Mb as possible. I think the algorithm could be made better but would require additional sorting logic to join tiny scaffs to medium scaffs to get closer to 1Mb files on average.