HuttleyLab / DiverseSeq

Tools for analysis of sequence divergence
BSD 3-Clause "New" or "Revised" License
3 stars 3 forks source link

app pipelines to speed up divergent commands #15

Closed KatherineCaley closed 6 months ago

KatherineCaley commented 6 months ago

Known issues with this implementation:

Comments on usage of the dvgt CLI

in prep, the input data store that wraps the sequence data is written to a temp directory. This is because I didn't want to duplicate the input seq data. However, this feels like a bad idea... It can violate data provenance via the source attribute of DataMembers. There is probably a better way to handle this.

Usage of the apps


# data_store setup
# if fasta seqs in SINGLE FILE
convert2dstore = dvgt_seq_file_to_data_store()
in_seqs = convert2dstore("path/to/fasta/file") 

# if multiple fasta seqs in directory
in_seqs = DirectoryDataStore("path/to/directory/of/fasta/files") 

out_dstore = HDF5DataStore(source="path/to/write/prepped/seqs.dvgtseqs", limit=limit)

# prep 
prep_pipeline = dvgt_load_seqs(moltype="dna") + dvgt_write_prepped_seqs(out_dstore)
prepped_seqs = prep_pipeline.apply_to(in_seqs, show_progress=True, parallel=True)

# max 
max_app = dvgt_calc("max") 
max_app("path/to/dvgtseqs")
# table of max divergent seqs

@GavinHuttley, definitely needs much refinement and work . Happy to keep working on it or handover.

GavinHuttley commented 6 months ago

Fails on Windows when trying to call max ... Should it be rewritten so that the file is opened and closed in some other way?

Is the HDF5 file being opened in read-only mode? For now, mark the failing test as an xfail with the reason being "windows specific failure to be resolved". Then create an issue with the traceback.

I'll then merge into that specific branch and handover to @khiron. As @khiron has a Windows box, he can run simple experiments to see if this is HDF5 specific, perhaps it acquires a file lock even in read-only mode?

KatherineCaley commented 6 months ago

@GavinHuttley @khiron now passing on windows