Open dasunpubudumal opened 3 months ago
Here are some optimisations that I have uncovered:
get_uniques
function can be optimised by introducing a std::collections::HashSet
. It should not invoke to_owned()
on new scaffold strings, as it will use String
's Clone
trait to clone the value, which is undesirable.It could be better if we can update the function as follows:
pub fn get_uniques(tpf_list: &[Tpf]) -> Vec<&String> {
let mut uniques = HashSet::new();
for tpf in tpf_list {
uniques.insert(&tpf.new_scaffold);
}
uniques.into_iter().collect()
}
This will not copy values in-memory, as it will only use Deref
trait when trying to insert values to the set. Note that this changes the function signature, and it demands a refactoring where the function is used.
subset_vec_tpf
function could be optimised by using Rust's filter
function provided to iterators.pub fn subset_vec_tpf<'a>(
tpf: &'a [Tpf],
fasta: (&std::string::String, &usize),
) -> Vec<&'a Tpf> {
//
// Subset the Vec<TPF> based on a search through the fasta
//
tpf.iter().filter(|&i| i.ori_scaffold == *fasta.0).collect()
}
src/tpf_fasta.rs:166
, both File::Create
and OpenOptions
are used to create an open a file. Instead, the following could be used (without using File::Create
):let file = OpenOptions::new()
.read(true)
.write(true)
.create(true)
.open(output);
OpenOptions
docs suggests:
This builder exposes the ability to configure how a File is opened and what operations are permitted on the open file. The
File::open
andFile::create
methods are aliases for commonly used options using this builder.
stacker::maybe_grow
in tpf_fasta
needs to be carefully considered.From maybe_grow
function documentation:
Once a program has reached the end of its stack, a temporary stack on the heap is allocated and is switched to for the duration of a closure.
This function is intended to be called at manually instrumented points in a program where recursion is known to happen quite a bit. This function will check to see if we're within
red_zone
bytes of the end of the stack, and if so it will allocate a new stack of at least stack_size bytes. The closuref
is guaranteed to run on a stack with at least red_zone bytes, and it will be run on the current stack if there's space available.
There's always box-types (Rust's equivalent to C++ smart pointers) which are there to allocate memory in the heap. Allocating extra stack space in the heap is a little unconventional. Stack is for allocating types that the compiler can identify the size of at compile time. Allocating something in the heap (e.g. recursive data structures like linked lists) incurs an additional overhead of a system signal to call a memory allocator but that is sort of a requirement to facilitate dynamically sized types such as vectors and strings.
curate_fasta
function.
Now that we have started writing tests in https://github.com/Rust-Wellcome/FasMan/issues/22, we could start adding some optimisations.
We can start optimisations with
src/tpf_fasta.rs
, as the file is being covered by tests, and is likely to be fully covered fairly soon.I will update this issue with comments when I discover any other optimisations possible.