Closed rmcar17 closed 3 weeks ago
This PR optimizes the CLI usage of the cluster tree functionality by introducing a new class dvs_cli_par_ctree
that directly reads sequences from HDF5 storage, eliminating unnecessary sequence format conversions. It also includes performance improvements in the Mash sketch calculation by optimizing data structures and using NumPy arrays instead of Python lists.
sequenceDiagram
participant User
participant CLI
participant dvs_cli_par_ctree
participant HDF5DataStore
participant ClusterTree
User->>CLI: Run ctree command
CLI->>dvs_cli_par_ctree: Initialize with parameters
dvs_cli_par_ctree->>HDF5DataStore: Read sequences from HDF5
HDF5DataStore-->>dvs_cli_par_ctree: Return sequence data
dvs_cli_par_ctree->>ClusterTree: Construct cluster tree
ClusterTree-->>dvs_cli_par_ctree: Return PhyloNode
dvs_cli_par_ctree-->>CLI: Return cluster tree
CLI-->>User: Output tree to file
classDiagram
class ClusterTreeBase {
<<abstract>>
}
class DvsParCtreeMixin {
+_mash_dist(seq_arrays: Sequence[SeqArray])
}
class dvs_par_ctree {
+__init__(k: int, sketch_size: int | None, moltype: str, distance_mode: Literal["mash", "euclidean"], mash_canonical_kmers: bool | None, show_progress: bool, max_workers: int | None, parallel: bool)
+main(seqs: c3_types.SeqsCollectionType) : PhyloNode
}
class dvs_cli_par_ctree {
+__init__(seq_store: str | Path, limit: int | None, k: int, sketch_size: int | None, moltype: str, distance_mode: Literal["mash", "euclidean"], mash_canonical_kmers: bool | None, show_progress: bool, max_workers: int | None, parallel: bool)
+main(seq_names: list[str]) : PhyloNode
}
ClusterTreeBase <|-- dvs_par_ctree
ClusterTreeBase <|-- dvs_cli_par_ctree
DvsParCtreeMixin <|.. dvs_par_ctree
DvsParCtreeMixin <|.. dvs_cli_par_ctree
note for dvs_cli_par_ctree "New class introduced to optimize CLI usage by reading sequences directly from HDF5 storage."
Change | Details | Files |
---|---|---|
Refactored cluster tree implementation by extracting common functionality into a mixin class |
|
src/diverse_seq/cluster.py |
Added new CLI-optimized parallel cluster tree implementation |
|
src/diverse_seq/cluster.py src/diverse_seq/cli.py |
Optimized Mash sketch calculation performance |
|
src/diverse_seq/distance.py |
Changes Missing Coverage | Covered Lines | Changed/Added Lines | % | ||
---|---|---|---|---|---|
src/diverse_seq/cli.py | 4 | 5 | 80.0% | ||
src/diverse_seq/cluster.py | 40 | 43 | 93.02% | ||
src/diverse_seq/distance.py | 1 | 7 | 14.29% | ||
<!-- | Total: | 45 | 55 | 81.82% | --> |
Files with Coverage Reduction | New Missed Lines | % | ||
---|---|---|---|---|
src/diverse_seq/data_store.py | 2 | 87.82% | ||
src/diverse_seq/distance.py | 4 | 83.85% | ||
<!-- | Total: | 6 | --> |
Totals | |
---|---|
Change from base Build 11720616749: | -1.0% |
Covered Lines: | 1188 |
Relevant Lines: | 1308 |
Summary by Sourcery
Enhancements:
dvs_par_ctree
class to use a mixin for shared functionality, improving code modularity and reusability.