SUBWORKFLOW: Split by size

DLBPointon commented 6 months ago

Sub workflow to split a fasta file into X size chunks where X is the requested size of the file.

splitbysize -f {100MB FASTA} -s {5MB} -o {OUTPUTDIR}

default naming to original name + "S{ITERATOR}" for split 1... 2... 3...

There will need to be check where if a scaffold = 7MB then you can't just split it at 5MB. The whole scaff needs to be saved to file. At the same time the final file may end up very small in the case of very fragmented genomes. e.g 5 scaffolds of 1000 bp?

DLBPointon commented 6 months ago

Medium as I don't know how you'd check size.

There is a validate_fasta function in the generics

Use noodles to read fasta and count the length of scaffolds? If scaff > 5MB then output() else add_to_Vec? then check vec size with and without?

DLBPointon commented 6 months ago

As of 748379d

The module is mostly complete, functions work and generates the appropriate folder structure.

TODO:

Add sanitise header as an optional (for if data is downloaded from ncbi/ensembl).
Add Memory Suffixes... I don't want to write 1000000000 instead of 1Gb

NOTES:

There is NO order retention due to using HashMaps to store data. This could be fixed by using a Vec to save records in order explicitly.
- Ultimately, for my use case we don't need the order

TESTS:

Local primitive testing results in:

$: cargo run splitbysize -f test_data/iyAndFlav1/iyAndFlav1.20231011.decontaminated.fa -m 1000000 -o ./here -d cdna
$: for i in here/iyAndFlav1/cdna/*fasta; do ls -lh $i; grep '>' $i | wc -l ; done

-rw-r--r-- 1 dp24 staff 695K May 29 14:57 here/iyAndFlav1/cdna/iyAndFlav1_f7_cdna.fasta
8
-rw-r--r-- 1 dp24 staff 41M May 29 14:57 here/iyAndFlav1/cdna/iyAndFlav1_f8_cdna.fasta
1
-rw-r--r-- 1 dp24 staff 529K May 29 14:57 here/iyAndFlav1/cdna/iyAndFlav1_f9_cdna.fasta
2

This was with trying to save 1Mb files. However, looking into these they are legit. It is trying to save as close to 1Mb as possible. I think the algorithm could be made better but would require additional sorting logic to join tiny scaffs to medium scaffs to get closer to 1Mb files on average.

Rust-Wellcome / FasMan

SUBWORKFLOW: Split by size #13