ENH: add `preprocess` subcommand

HuttleyLab / DiverseSeq

Tools for analysis of sequence divergence

BSD 3-Clause "New" or "Revised" License

3 stars 3 forks source link

ENH: add `preprocess` subcommand #8

Closed KatherineCaley closed 7 months ago

KatherineCaley commented 7 months ago

The preprocess subcommand should convert input sequences into an numpy array of indices (using to_indices via the moltype), and write them as a hdf5 file.

Every genome is a top-level database for which the link/name to the database is the species name. For each genome, we will store the following attributes:

molecular type (as a str)
path to the original file

As discussed, we will not compress the hdf5 file for starters

KatherineCaley commented 7 months ago

For a simple implementation, we do see a speed up for loading from hdf5, although writing to file is not very performant.

writing (already loaded sequences) to hdf5 takes 3.3X longer than just loading the sequences
reading from hdf5 and recapitulating the original sequences is 3.3X faster than loading the sequences

GavinHuttley commented 7 months ago

hows performance affected if you turn on compression? both write and read should improve. I think h5py comes with a LZF algorithm which is fast and there's a hd5 plugin for blosc2 (?)

KatherineCaley commented 7 months ago

Hold up, i profiled the code and the writing is trivial, its the conversion to indices that is taking most of the time

GavinHuttley commented 7 months ago

OK! What's the code being used for conversion? Is that standard cogent3? If so, maybe try divergent.util.str2arr

KatherineCaley commented 7 months ago

I'm using to_indices

 alphbt = seqs.moltype.alphabets.degen_gapped
    ...
    with h5py.File(outpath, mode="w") as f:  
        for s in seqs.iter_seqs():
            s_i = alphbt.to_indices(s)

KatherineCaley commented 7 months ago

I see, str2arr is very quick!

GavinHuttley commented 7 months ago

Yes, that needs to be migrated into cogent3 alphabet.

KatherineCaley commented 7 months ago

I'm working on the best way to remove the overlap in max and add handling for whether the data is coming in as a hdf5 of fasta file

As discussed in this morning's meeting we will require preprocess, enforced by checking for a custom suffix