HuttleyLab / DiverseSeq

Tools for analysis of sequence divergence
BSD 3-Clause "New" or "Revised" License
3 stars 3 forks source link

ENH: Optimise CLI Usage of `ctree` #71

Closed rmcar17 closed 3 weeks ago

rmcar17 commented 3 weeks ago

Summary by Sourcery

Enhancements:

sourcery-ai[bot] commented 3 weeks ago

Reviewer's Guide by Sourcery

This PR optimizes the CLI usage of the cluster tree functionality by introducing a new class dvs_cli_par_ctree that directly reads sequences from HDF5 storage, eliminating unnecessary sequence format conversions. It also includes performance improvements in the Mash sketch calculation by optimizing data structures and using NumPy arrays instead of Python lists.

Sequence diagram for optimized CLI usage of cluster tree

sequenceDiagram
    participant User
    participant CLI
    participant dvs_cli_par_ctree
    participant HDF5DataStore
    participant ClusterTree

    User->>CLI: Run ctree command
    CLI->>dvs_cli_par_ctree: Initialize with parameters
    dvs_cli_par_ctree->>HDF5DataStore: Read sequences from HDF5
    HDF5DataStore-->>dvs_cli_par_ctree: Return sequence data
    dvs_cli_par_ctree->>ClusterTree: Construct cluster tree
    ClusterTree-->>dvs_cli_par_ctree: Return PhyloNode
    dvs_cli_par_ctree-->>CLI: Return cluster tree
    CLI-->>User: Output tree to file

Updated class diagram for cluster tree classes

classDiagram
    class ClusterTreeBase {
        <<abstract>>
    }

    class DvsParCtreeMixin {
        +_mash_dist(seq_arrays: Sequence[SeqArray])
    }

    class dvs_par_ctree {
        +__init__(k: int, sketch_size: int | None, moltype: str, distance_mode: Literal["mash", "euclidean"], mash_canonical_kmers: bool | None, show_progress: bool, max_workers: int | None, parallel: bool)
        +main(seqs: c3_types.SeqsCollectionType) : PhyloNode
    }

    class dvs_cli_par_ctree {
        +__init__(seq_store: str | Path, limit: int | None, k: int, sketch_size: int | None, moltype: str, distance_mode: Literal["mash", "euclidean"], mash_canonical_kmers: bool | None, show_progress: bool, max_workers: int | None, parallel: bool)
        +main(seq_names: list[str]) : PhyloNode
    }

    ClusterTreeBase <|-- dvs_par_ctree
    ClusterTreeBase <|-- dvs_cli_par_ctree
    DvsParCtreeMixin <|.. dvs_par_ctree
    DvsParCtreeMixin <|.. dvs_cli_par_ctree

    note for dvs_cli_par_ctree "New class introduced to optimize CLI usage by reading sequences directly from HDF5 storage."

File-Level Changes

Change Details Files
Refactored cluster tree implementation by extracting common functionality into a mixin class
  • Created DvsParCtreeMixin class to share common functionality between cluster tree implementations
  • Moved _mash_dist and related methods to the mixin class
src/diverse_seq/cluster.py
Added new CLI-optimized parallel cluster tree implementation
  • Created dvs_cli_par_ctree class that directly reads from HDF5 storage
  • Implemented direct sequence array handling without intermediate conversions
  • Added support for limiting the number of sequences processed
src/diverse_seq/cluster.py
src/diverse_seq/cli.py
Optimized Mash sketch calculation performance
  • Replaced Python set with NumPy unique function for kmer hash deduplication
  • Changed kmer hash storage from list to NumPy array
  • Optimized heap initialization in mash_sketch function
  • Added numba.njit decorator to mash_sketch for improved performance
src/diverse_seq/distance.py

Tips and commands #### Interacting with Sourcery - **Trigger a new review:** Comment `@sourcery-ai review` on the pull request. - **Continue discussions:** Reply directly to Sourcery's review comments. - **Generate a GitHub issue from a review comment:** Ask Sourcery to create an issue from a review comment by replying to it. - **Generate a pull request title:** Write `@sourcery-ai` anywhere in the pull request title to generate a title at any time. - **Generate a pull request summary:** Write `@sourcery-ai summary` anywhere in the pull request body to generate a PR summary at any time. You can also use this command to specify where the summary should be inserted. #### Customizing Your Experience Access your [dashboard](https://app.sourcery.ai) to: - Enable or disable review features such as the Sourcery-generated pull request summary, the reviewer's guide, and others. - Change the review language. - Add, remove or edit custom review instructions. - Adjust other review settings. #### Getting Help - [Contact our support team](mailto:support@sourcery.ai) for questions or feedback. - Visit our [documentation](https://docs.sourcery.ai) for detailed guides and information. - Keep in touch with the Sourcery team by following us on [X/Twitter](https://x.com/SourceryAI), [LinkedIn](https://www.linkedin.com/company/sourcery-ai/) or [GitHub](https://github.com/sourcery-ai).
coveralls commented 3 weeks ago

Pull Request Test Coverage Report for Build 11734163625

Details


Changes Missing Coverage Covered Lines Changed/Added Lines %
src/diverse_seq/cli.py 4 5 80.0%
src/diverse_seq/cluster.py 40 43 93.02%
src/diverse_seq/distance.py 1 7 14.29%
<!-- Total: 45 55 81.82% -->
Files with Coverage Reduction New Missed Lines %
src/diverse_seq/data_store.py 2 87.82%
src/diverse_seq/distance.py 4 83.85%
<!-- Total: 6 -->
Totals Coverage Status
Change from base Build 11720616749: -1.0%
Covered Lines: 1188
Relevant Lines: 1308

💛 - Coveralls