HuttleyLab / DiverseSeq

Tools for analysis of sequence divergence
BSD 3-Clause "New" or "Revised" License
3 stars 3 forks source link

ENH: add `preprocess` subcommand #8

Closed KatherineCaley closed 7 months ago

KatherineCaley commented 7 months ago

The preprocess subcommand should convert input sequences into an numpy array of indices (using to_indices via the moltype), and write them as a hdf5 file.

Every genome is a top-level database for which the link/name to the database is the species name. For each genome, we will store the following attributes:

As discussed, we will not compress the hdf5 file for starters

KatherineCaley commented 7 months ago

For a simple implementation, we do see a speed up for loading from hdf5, although writing to file is not very performant.

GavinHuttley commented 7 months ago

hows performance affected if you turn on compression? both write and read should improve. I think h5py comes with a LZF algorithm which is fast and there's a hd5 plugin for blosc2 (?)

KatherineCaley commented 7 months ago

Hold up, i profiled the code and the writing is trivial, its the conversion to indices that is taking most of the time

GavinHuttley commented 7 months ago

OK! What's the code being used for conversion? Is that standard cogent3? If so, maybe try divergent.util.str2arr

KatherineCaley commented 7 months ago

I'm using to_indices

 alphbt = seqs.moltype.alphabets.degen_gapped
    ...
    with h5py.File(outpath, mode="w") as f:  
        for s in seqs.iter_seqs():
            s_i = alphbt.to_indices(s)
KatherineCaley commented 7 months ago

I see, str2arr is very quick!

GavinHuttley commented 7 months ago

Yes, that needs to be migrated into cogent3 alphabet.

KatherineCaley commented 7 months ago

I'm working on the best way to remove the overlap in max and add handling for whether the data is coming in as a hdf5 of fasta file

As discussed in this morning's meeting we will require preprocess, enforced by checking for a custom suffix