COMBINE-lab / pufferfish

An efficient index for the colored, compacted, de Bruijn graph
GNU General Public License v3.0
107 stars 19 forks source link

Parallelize fixFasta #39

Open hermidalc opened 1 year ago

hermidalc commented 1 year ago

The initial fixFasta step of Pufferfish indexing is single-threaded, and when there are a lot of sequences in the reference it takes a lot of time. From the outside it seems like this step could be parallelized, with the input reference FASTA split into parts, e.g. using the fast SeqKit toolkit and split2 command, which can output gzipped or regular split FASTA files from a gzipped or regular input reference FASTA (to save disk space for example), and then processing each split using fixFasta and concatenating the fixed splits into one.