Closed manulera closed 1 month ago
Hi @manulera nice work! Looking good so far!
To get the AA's truncating in a better way, I'd look into modifying the logic around isTruncatedEnd/isTruncatedStart:
I've added a bit of code to start you down the right path with that on this branch https://github.com/TeselaGen/tg-oss/compare/tnwFiddleWithAaGaps
Here's what the example seq looks like rn there:
I think with a bit more trial and error you should be able to get it rendering pretty nicely!
Good luck and lemme know if you need anything else!
Closing as this has been merged and published :)
I found the meaning of the includeEndEdge argument a bit confusing in the function isPositionWithinRange. For instance, with the name of the argument, I would have expected isPositionWithinRange(3,{start:1,end:2}, 4, true, true) to be false, and it's true. But isPositionWithinRange(0,{start:1,end:2}, 4, true, true) is false. I think it would be good to use a more self-explaining argument name or explain more in the docstring what exactly this argument represents.
This is working as expected AFAICT with range indices being 0-based.
G T A C C <-- sequence
0 1 2 3 4 <-- sequence indices
r r <-- range
0 1 2 3 4 5 <-- position indices
Oh, I see, I didn't realise that position and index referred to different things. It makes sense now. Thanks for the explanation.
By the way, I meant to say it before, but it has been quite manageable navigating the codebase and contributing despite it being so big, so congrats for the great repo!
Thanks @manulera ! Glad that you're helping in such a big way!
Hi @tnrich, apologies in advance for the long issue, but there was a bunch of things.
This is the
sequence-utils
part of #69. There are three commits:X
aminoacid corresponding to truncated codons was added separately and differently for forward and reverse translations). I think it makes the code a bit clearer, and I thought doing a single loop was necessary to implement the next feature. I have also splitted out some pre-processing into a separate function to have only the logic ingetAminoAcidDataForEachBaseOfDna
.getAminoAcidDataForEachBaseOfDna
in the protein case. Some are skipped (see below for questions on this).Some tests fail, but I am not sure they are related to the feature I implemented. Some don't seem to (mainMenu fail for instance).
Below the questions I had
Unrelated to the implementation
optionalSubrangeRange
is a feature-like object (withstart
andend
) passed togetAminoAcidDataForEachBaseOfDna
to indicate the subset of the sequence that would be translated. Should this be allowed at all whenisProtein === true
? The reason why I ask is because now thelocations
of that feature are used to describe the exons (joined feature CDS-like) and that would be quite meaningless for a protein sequence. Not sure if it's worth to allowoptionalSubrangeRange
in the first place. The old code ignores the range for protein sequences (I have added some skipped tests togetAminoAcidDataForEachBaseOfDna.test.js
, they are self-explaining). In my implementation, I have kept it that way, but it's probably best to return an error instead, since people may think that passing the range works for protein sequences.I found the meaning of the
includeEndEdge
argument a bit confusing in the functionisPositionWithinRange
. For instance, with the name of the argument, I would have expectedisPositionWithinRange(3,{start:1,end:2}, 4, true, true)
to be false, and it's true. ButisPositionWithinRange(0,{start:1,end:2}, 4, true, true)
is false. I think it would be good to use a more self-explaining argument name or explain more in the docstring what exactly this argument represents.Related to the implementation
Right now, the output of
getAminoAcidDataForEachBaseOfDna
is an array that contains the aminoacid-level info for each basepair. For instance, for the basesATG
of theATG
of a CDS, the corresponding array is (wheregetAA("atg")
returns an object with aminoacid properties):The question is what to do when a basepair is part of an intron. The easiest thing (that gives the right representation in the UI) is to insert some kind of "mock" element in the array. Currently I am inserting something like this:
The function expects to have an array of the same length as the input protein sequence, so I think something in these lines makes the most sense. It also gives good rendering (see below). What do you think?
Representation
As it is, it already gives a better representation that before (before the wrong amino acid sequence was displayed, because it was not taking into account the introns), but there are some awkward things.
You can see all the examples below in example.zip. You can reverse the sequence and rotate it and you can see that even if the features span the origin or inverted, they behave the same way.
If the intron does not split an aminoacid, it looks good:
If the intron splits the aminoacid into two, the aminoacid is shown like this (displaced towards the side with most bases).
There is also the case where the aminoacid is splitted into three exons, because there is an exon with a single basepair.
Improvements to representation?
For amino acids that are splitted into several exons, I think it would be better to use the representation that is used when the aminoacid is splitted between two lines in the sequence map (bottom image), so there would be a fragment of the aminoacid symbol in each bit of the intron. I can give a go at this if you give me some guidance.
I think it would also be good to have a thin line joining the translation bits that correspond to the same CDS, like the thin line that connects the different
positions
in a feature with multiple parts, something like below. I can give a go at this if you give me some advice as well.