Loading branch lengths into a PhyloDM object

FinnOD commented 1 year ago

Hi Aaron,

I was intending on teaching myself rust to write something similar but you've done a great job, I'm seeing a 500x speed increase. Thanks!

However for me (~1000 leaves), 95% of that time is spent manipulating python tree objects and loading them into PhyloDM (so 10,000x increase perhaps possible). I'd like to request a python exposed function that loads an array of branch lengths into the rust object. I'd be happy to try make a PR if you could let me know where the branch lengths have to end up, as I don't really understand the internal data structures you're using.

Thanks for any help!

Finn

aaronmussig commented 1 year ago

Hi Finn,

Thanks! I'm glad it's of some use to you. Coincidentally this was my first Rust project, it was very good to learn the language and its nuances. Like many others, I learned to love the language after I was no longer fighting the compiler. I don't want to take away from you learning Rust so I'm happy to guide you as much as you need. If you find you have lost interest or don't have the time I can also implement it.

You're right in that moving from the Python DendroPy object to Rust is pretty slow, it is quicker to load it from a newick string as the light_phylogeny crate supports loading from that format.

Here's as much information as I can think of giving to point you in the right direction, but feel free to ask away if you have any questions.

1. Background on the Rust data structures used

Initially, I designed the tree data structure in the same way you would in Python; a Tree that contains Nodes, where the Nodes have pointers to the parents and children. You'll notice that this isn't what I settled on as simply, it is hell to implement in Rust. I would recommend giving it a go as you will learn a lot about lifetimes in Rust.

Ultimately, I ended up using an Arena Tree. Essentially the Tree struct owns all the nodes, and Node references to parent/children are just indices that need to be mapped to their corresponding Node in the Tree struct. A key thing to note here is that there is no safety in references, if a Node is deleted from the tree you've got to be very careful that all references are removed, hence why there are no methods to prune nodes.

2. Exposing to Python

This is done through PyO3. A key thing to note is that Cargo.toml defines a [features] section with the name python. This is so we don't need to ship the PyO3 and numpy crates when using the package in a Rust environment. However, when it's imported as a Python library, the crate is built with that option in setup.py.

The conditional build of features means that any Rust code with the macro #[cfg(feature = "python")] will be built, otherwise, it's ignored. This is only used in src/lib.rs where the src/python.rs file is conditionally built.

src/python.rs is where we define the Python object that PyO3 will export. After it is built, it can be imported in Python from the .so file using from .pdm import PhyloDM as shown here.

I've done something a little different from what you would normally do here but with good reason. I've wrapped the PyO3 automatically exported class in a Python class in__init__.py, purely to add type hinting, and a @classmethod to instantiate the object from a DendroPy object.

3. The rust data structures

src/pdm.rs contains the main Phylogenetic Distance Matrix (Tree) class, which contains all nodes, and taxa.

src/node.rs contains a single node within the tree, with the index of its parent and children.

Not too much to say about them that isn't already described in the Arena Tree article already linked.

4. Creating a PDM

If this is being done from a newick file then the load_from_newick_path function is called, it's good to check this out to see how it's done.

Otherwise, nodes and branch lengths are manually added using the add_node and add_edge functions. Check out those to see how the internal structures are manipulated.

You will need to call the compute_row_vec function to do some pre-processing of the nodes before creating the distance matrix (automatically called if loading from a newick). This will extract the distances from each Node to it's parent and put them in a non-redundant row vector that is essentially the matrix but each index is mapped using the row_idx_from_mat_coords function.

Finally the matrix function returns the pairwise distance matrix. Alternatively, distances can be obtained through the distance function.

5. Building for testing

If you wish to generate the .so file so you can run the Rust code, you need to run:

python setup.py build_ext --inplace and it will generate the .so file in the phylodm/ directory. Note you will need to run this every time you change the rust code before running it in Python.

To run it in Rust, you will need to build it with all features: cargo build --all-features

To run unit tests in rust: cargo test --all-features.

To run unit tests in Python (note this will also run the Rust unit tests): Note: You will need dendropy, setuptools-rust, setuptools, wheel, and numpy in the environment which you're using for development.

cd /path/to/cloned/directory
export PYTHONPATH='.'
python setup.py build_ext --inplace
python -m unittest discover test/

6. Implementation?

I've left this largely up to you to explore how it could be done, but I could imagine that largely you could get away with manually calling the add_node and add_edge functions by iterating over the distances. Note that these are already exposed to Python, but as you've mentioned there would be a lot less overhead if Rust could do the iterating.

I imagine it would be creating a new method that can be exposed to Python that allows Rust to iterate over the distance vector/matrix you've got, then calling those functions.

I've been thinking about how to formulate this as purely matrix operations, using some form of tensor decomposition but I'm unsure about how to do this, or if it's even a good idea for sparse matrices.

Cheers, Aaron

FinnOD commented 1 year ago

Hi Aaron,

Thank you so much for such a kind write up! Your suggestions and build/test commands were very helpful. I don't know if I learnt much rust as I mostly bashed my way through by copying your code and adding &s and *s whenever the compiler suggested it. But thank you, baby steps right?

Please have a look at my fork here. Not really sure where to go from here re. github and PRs. Also, please note I made a few extraneous modifications like making PDM.n_nodes public and adding a test tree where all the branch lengths are 7. If you're happy with my code style etc. I'd be happy to make another issue and branch for direct newick parsing (from python string not file).

So I would like to submit my PR, but there is something irking me about the implementation. You can see it in the rust tests at ./tests/tree.rs. The total length of the tree can be different even when the DM is the same. I think this is because the root is being given a branch length, but from what I can see in the newick files the root is usually given one anyway.

I hope you can help and that I haven't bungled the project too badly 😌

Cheers, Finn

aaronmussig / PhyloDM