Rust-Wellcome / FasMan

A re-write (+ extras) of Python scripts, used in Tree of Life, into a single Rust script.
3 stars 1 forks source link

Optimisations #32

Open dasunpubudumal opened 3 months ago

dasunpubudumal commented 3 months ago

Now that we have started writing tests in https://github.com/Rust-Wellcome/FasMan/issues/22, we could start adding some optimisations.

We can start optimisations with src/tpf_fasta.rs, as the file is being covered by tests, and is likely to be fully covered fairly soon.

I will update this issue with comments when I discover any other optimisations possible.

dasunpubudumal commented 3 months ago

Here are some optimisations that I have uncovered:

  1. get_uniques function can be optimised by introducing a std::collections::HashSet. It should not invoke to_owned() on new scaffold strings, as it will use String's Clone trait to clone the value, which is undesirable.

It could be better if we can update the function as follows:

pub fn get_uniques(tpf_list: &[Tpf]) -> Vec<&String> {
    let mut uniques = HashSet::new();
    for tpf in tpf_list {
        uniques.insert(&tpf.new_scaffold);
    }
    uniques.into_iter().collect()
 }

This will not copy values in-memory, as it will only use Deref trait when trying to insert values to the set. Note that this changes the function signature, and it demands a refactoring where the function is used.

  1. subset_vec_tpf function could be optimised by using Rust's filter function provided to iterators.
pub fn subset_vec_tpf<'a>(
    tpf: &'a [Tpf],
    fasta: (&std::string::String, &usize),
) -> Vec<&'a Tpf> {
    //
    // Subset the Vec<TPF> based on a search through the fasta
    //
    tpf.iter().filter(|&i| i.ori_scaffold == *fasta.0).collect()
}
dasunpubudumal commented 3 months ago
  1. File logic can be refactored. In src/tpf_fasta.rs:166, both File::Create and OpenOptions are used to create an open a file. Instead, the following could be used (without using File::Create):
let file = OpenOptions::new()
            .read(true)
            .write(true)
            .create(true)
            .open(output);

OpenOptions docs suggests:

This builder exposes the ability to configure how a File is opened and what operations are permitted on the open file. The File::open and File::create methods are aliases for commonly used options using this builder.

dasunpubudumal commented 2 months ago
  1. The usage of stacker::maybe_grow in tpf_fasta needs to be carefully considered.

From maybe_grow function documentation:

Once a program has reached the end of its stack, a temporary stack on the heap is allocated and is switched to for the duration of a closure.

This function is intended to be called at manually instrumented points in a program where recursion is known to happen quite a bit. This function will check to see if we're within red_zone bytes of the end of the stack, and if so it will allocate a new stack of at least stack_size bytes. The closure f is guaranteed to run on a stack with at least red_zone bytes, and it will be run on the current stack if there's space available.

There's always box-types (Rust's equivalent to C++ smart pointers) which are there to allocate memory in the heap. Allocating extra stack space in the heap is a little unconventional. Stack is for allocating types that the compiler can identify the size of at compile time. Allocating something in the heap (e.g. recursive data structures like linked lists) incurs an additional overhead of a system signal to call a memory allocator but that is sort of a requirement to facilitate dynamically sized types such as vectors and strings.

dasunpubudumal commented 2 months ago
  1. Updates to the logic in curate_fasta function.