Closed TheLostLambda closed 2 years ago
After pondering on this a bit more, I think we should aim for a correct average GC content with IUPAC sequences. Any IUPAC placeholder that could represent a G or a C should be converted into a GC probability.
C = 1.0
G = 1.0
S = 1.0
M = 0.5
K = 0.5
R = 0.5
Y = 0.5
B = 2/3
D = 1/3
H = 1/3
V = 2/3
N = 0.25
I've had a chat with some of the Rust Bio people and they seem open to implementing this upstream! We should add a gc_iupac()
function that handles these non-standard alphabet characters to this module!
Whoever is keen to work on this, let me know and I'll help with the upstreaming process!
@adam-spencer This is probably the easiest place to start if you are keen on giving this one a go! :) Doesn't need any new web UI either, so should be able to immediately add it to the toolbox!
I've made this tool on the adam-happytimes
branch, though I'm wondering if the return value should be formatted as a percentage or a fraction? That is, 56.2
or 0.562
.
I'm yet to implement anything in the way of IUPAC stuff, though I was thinking perhaps a secondary function should be used? Using only an f64
doesn't provide any information regarding how much of the result is a 'probable' G or C result. If this isn't necessary please let me know @TheLostLambda.
I've made this tool on the adam-happytimes branch, though I'm wondering if the return value should be formatted as a percentage or a fraction? That is, 56.2 or 0.562.
On second thought, I do actually think that a ratio makes the most sense to return. That seems to be what the rust-bio version of this function does: https://docs.rs/bio/latest/bio/seq_analysis/gc/fn.gc_content.html
I'm yet to implement anything in the way of IUPAC stuff, though I was thinking perhaps a secondary function should be used?
I think from a biology perspective it makes the most sense just to have a single function here, wrapping up everything in that f64
, but I can also see the case for adding a tool down the line that can give a range of GC percent possibilities (a minimum and maximum, in addition to this average).
The branch is looking pretty good! I think adding some IUPAC support + tests and a bit of benchmarking would be great, then feel free to open a PR when you think it's ready.
What should this tool do? This tool should calculate the percentage of bases in a nucleotide sequence that are either G or C.
Is there an existing reference implementation? See this Rosalind Implementation. Note that this tool does not need to deal with FASTA!
What are the tool's inputs? A DNA or RNA sequence
What is the tool's output? A single percentage represented as an
f64
Other Implementation Details Everything needed to implement this should already be available! The
.count_elements()
method can be used to calculate the number of C's and G's, then a percent can be obtained by dividing by the total sequence length (using.len()
)As an additional complexity, supporting IUPAC means that symbols other than A, C, G, T, and U can be present! You'll likely want to count Strong (S) bases as well, as they are either a C or G. See this page for more information. It's possible we can handle IUPAC better than that, so feel free to chip in with any ideas!