Issue 69 2 - Githubissues

manulera commented 1 month ago

Hi @tnrich, apologies in advance for the long issue, but there was a bunch of things.

This is the sequence-utils part of #69. There are three commits:

dcdaf447171cbb0ec0cbfdac7c5eb535314cc0b7: re-factor of getAminoAcidDataForEachBaseOfDna.js. It removes special cases, and does everything in the same loop (before the X aminoacid corresponding to truncated codons was added separately and differently for forward and reverse translations). I think it makes the code a bit clearer, and I thought doing a single loop was necessary to implement the next feature. I have also splitted out some pre-processing into a separate function to have only the logic in getAminoAcidDataForEachBaseOfDna.
878ce24fd3fcd27d9e8eb39a9e26f47a59678817: added a few extra tests for getAminoAcidDataForEachBaseOfDna in the protein case. Some are skipped (see below for questions on this).
6c451353b8cead342ced0ad64d3336075165a3d3: actual implementation of the feature. Tests are still missing because I wanted to double-check with you a few things before writing them (see below).

Some tests fail, but I am not sure they are related to the feature I implemented. Some don't seem to (mainMenu fail for instance).

Below the questions I had

Unrelated to the implementation

optionalSubrangeRange is a feature-like object (with start and end) passed to getAminoAcidDataForEachBaseOfDna to indicate the subset of the sequence that would be translated. Should this be allowed at all when isProtein === true? The reason why I ask is because now the locations of that feature are used to describe the exons (joined feature CDS-like) and that would be quite meaningless for a protein sequence. Not sure if it's worth to allow optionalSubrangeRange in the first place. The old code ignores the range for protein sequences (I have added some skipped tests to getAminoAcidDataForEachBaseOfDna.test.js, they are self-explaining). In my implementation, I have kept it that way, but it's probably best to return an error instead, since people may think that passing the range works for protein sequences.
I found the meaning of the includeEndEdge argument a bit confusing in the function isPositionWithinRange. For instance, with the name of the argument, I would have expected isPositionWithinRange(3,{start:1,end:2}, 4, true, true) to be false, and it's true. But isPositionWithinRange(0,{start:1,end:2}, 4, true, true) is false. I think it would be good to use a more self-explaining argument name or explain more in the docstring what exactly this argument represents.

Related to the implementation

Right now, the output of getAminoAcidDataForEachBaseOfDna is an array that contains the aminoacid-level info for each basepair. For instance, for the bases ATG of the ATG of a CDS, the corresponding array is (where getAA("atg") returns an object with aminoacid properties):

aaData = getAminoAcidDataForEachBaseOfDna("atg", true);
    assert.deepEqual(aaData, [
      {
        aminoAcid: getAA("atg"),
        positionInCodon: 0,
        aminoAcidIndex: 0,
        sequenceIndex: 0,
        codonRange: {
          start: 0,
          end: 2
        },
        fullCodon: true
      },
      {
        aminoAcid: getAA("atg"),
        positionInCodon: 1,
        aminoAcidIndex: 0,
        sequenceIndex: 1,
        codonRange: {
          start: 0,
          end: 2
        },
        fullCodon: true
      },
      {
        aminoAcid: getAA("atg"),
        positionInCodon: 2,
        aminoAcidIndex: 0,
        sequenceIndex: 2,
        codonRange: {
          start: 0,
          end: 2
        },
        fullCodon: true
      }
    ]);

The question is what to do when a basepair is part of an intron. The easiest thing (that gives the right representation in the UI) is to insert some kind of "mock" element in the array. Currently I am inserting something like this:

{
    aminoAcid: null,
    positionInCodon: null,
    aminoAcidIndex: null,
    sequenceIndex: 2,
    codonRange: null,
    fullCodon: null
}

The function expects to have an array of the same length as the input protein sequence, so I think something in these lines makes the most sense. It also gives good rendering (see below). What do you think?

Representation

As it is, it already gives a better representation that before (before the wrong amino acid sequence was displayed, because it was not taking into account the introns), but there are some awkward things.

You can see all the examples below in example.zip. You can reverse the sequence and rotate it and you can see that even if the features span the origin or inverted, they behave the same way.

If the intron does not split an aminoacid, it looks good:

Screenshot 2024-06-04 at 10 54 17

If the intron splits the aminoacid into two, the aminoacid is shown like this (displaced towards the side with most bases).

There is also the case where the aminoacid is splitted into three exons, because there is an exon with a single basepair.

Improvements to representation?

For amino acids that are splitted into several exons, I think it would be better to use the representation that is used when the aminoacid is splitted between two lines in the sequence map (bottom image), so there would be a fragment of the aminoacid symbol in each bit of the intron. I can give a go at this if you give me some guidance.

I think it would also be good to have a thin line joining the translation bits that correspond to the same CDS, like the thin line that connects the different positions in a feature with multiple parts, something like below. I can give a go at this if you give me some advice as well.

Screenshot 2024-06-04 at 10 54 17

tnrich commented 1 month ago

Hi @manulera nice work! Looking good so far!

To get the AA's truncating in a better way, I'd look into modifying the logic around isTruncatedEnd/isTruncatedStart:

I've added a bit of code to start you down the right path with that on this branch https://github.com/TeselaGen/tg-oss/compare/tnwFiddleWithAaGaps

Here's what the example seq looks like rn there:

I think with a bit more trial and error you should be able to get it rendering pretty nicely!

Good luck and lemme know if you need anything else!

tnrich commented 1 month ago

Closing as this has been merged and published :)

tnrich commented 1 month ago

I found the meaning of the includeEndEdge argument a bit confusing in the function isPositionWithinRange. For instance, with the name of the argument, I would have expected isPositionWithinRange(3,{start:1,end:2}, 4, true, true) to be false, and it's true. But isPositionWithinRange(0,{start:1,end:2}, 4, true, true) is false. I think it would be good to use a more self-explaining argument name or explain more in the docstring what exactly this argument represents.

This is working as expected AFAICT with range indices being 0-based.

 G T A C C  <-- sequence
 0 1 2 3 4  <-- sequence indices
   r r      <-- range
0 1 2 3 4 5  <-- position indices

manulera commented 1 month ago

Oh, I see, I didn't realise that position and index referred to different things. It makes sense now. Thanks for the explanation.

By the way, I meant to say it before, but it has been quite manageable navigating the codebase and contributing despite it being so big, so congrats for the great repo!

tnrich commented 1 month ago

Thanks @manulera ! Glad that you're helping in such a big way!

TeselaGen / tg-oss

Issue 69 2 #76

Unrelated to the implementation

Related to the implementation

Representation

Improvements to representation?