Sheffield-iGEM / syn-zeug

A modern toolbox for synthetic biology
https://sheffield-igem.github.io/syn-zeug/
GNU Affero General Public License v3.0
6 stars 3 forks source link

Implement New Tool: "GC Content" #3

Closed TheLostLambda closed 2 years ago

TheLostLambda commented 2 years ago

What should this tool do? This tool should calculate the percentage of bases in a nucleotide sequence that are either G or C.

Is there an existing reference implementation? See this Rosalind Implementation. Note that this tool does not need to deal with FASTA!

What are the tool's inputs? A DNA or RNA sequence

What is the tool's output? A single percentage represented as an f64

Other Implementation Details Everything needed to implement this should already be available! The .count_elements() method can be used to calculate the number of C's and G's, then a percent can be obtained by dividing by the total sequence length (using .len())

As an additional complexity, supporting IUPAC means that symbols other than A, C, G, T, and U can be present! You'll likely want to count Strong (S) bases as well, as they are either a C or G. See this page for more information. It's possible we can handle IUPAC better than that, so feel free to chip in with any ideas!

TheLostLambda commented 2 years ago

After pondering on this a bit more, I think we should aim for a correct average GC content with IUPAC sequences. Any IUPAC placeholder that could represent a G or a C should be converted into a GC probability.

C = 1.0 G = 1.0 S = 1.0 M = 0.5 K = 0.5 R = 0.5 Y = 0.5 B = 2/3 D = 1/3 H = 1/3 V = 2/3 N = 0.25

TheLostLambda commented 2 years ago

I've had a chat with some of the Rust Bio people and they seem open to implementing this upstream! We should add a gc_iupac() function that handles these non-standard alphabet characters to this module!

Whoever is keen to work on this, let me know and I'll help with the upstreaming process!

TheLostLambda commented 2 years ago

@adam-spencer This is probably the easiest place to start if you are keen on giving this one a go! :) Doesn't need any new web UI either, so should be able to immediately add it to the toolbox!

adam-spencer commented 2 years ago

I've made this tool on the adam-happytimes branch, though I'm wondering if the return value should be formatted as a percentage or a fraction? That is, 56.2 or 0.562.

I'm yet to implement anything in the way of IUPAC stuff, though I was thinking perhaps a secondary function should be used? Using only an f64 doesn't provide any information regarding how much of the result is a 'probable' G or C result. If this isn't necessary please let me know @TheLostLambda.

TheLostLambda commented 2 years ago

I've made this tool on the adam-happytimes branch, though I'm wondering if the return value should be formatted as a percentage or a fraction? That is, 56.2 or 0.562.

On second thought, I do actually think that a ratio makes the most sense to return. That seems to be what the rust-bio version of this function does: https://docs.rs/bio/latest/bio/seq_analysis/gc/fn.gc_content.html

I'm yet to implement anything in the way of IUPAC stuff, though I was thinking perhaps a secondary function should be used?

I think from a biology perspective it makes the most sense just to have a single function here, wrapping up everything in that f64, but I can also see the case for adding a tool down the line that can give a range of GC percent possibilities (a minimum and maximum, in addition to this average).

The branch is looking pretty good! I think adding some IUPAC support + tests and a bit of benchmarking would be great, then feel free to open a PR when you think it's ready.