Closed KatherineCaley closed 7 months ago
For a simple implementation, we do see a speed up for loading from hdf5, although writing to file is not very performant.
hows performance affected if you turn on compression? both write and read should improve. I think h5py comes with a LZF algorithm which is fast and there's a hd5 plugin for blosc2 (?)
Hold up, i profiled the code and the writing is trivial, its the conversion to indices that is taking most of the time
OK! What's the code being used for conversion? Is that standard cogent3? If so, maybe try divergent.util.str2arr
I'm using to_indices
alphbt = seqs.moltype.alphabets.degen_gapped
...
with h5py.File(outpath, mode="w") as f:
for s in seqs.iter_seqs():
s_i = alphbt.to_indices(s)
I see, str2arr
is very quick!
Yes, that needs to be migrated into cogent3 alphabet.
I'm working on the best way to remove the overlap in max and add handling for whether the data is coming in as a hdf5 of fasta file
As discussed in this morning's meeting we will require preprocess, enforced by checking for a custom suffix
The
preprocess
subcommand should convert input sequences into an numpy array of indices (usingto_indices
via the moltype), and write them as a hdf5 file.Every genome is a top-level database for which the link/name to the database is the species name. For each genome, we will store the following attributes:
As discussed, we will not compress the hdf5 file for starters