PoonLab / sierra-local

Retrieve HIVdb algorithm as XML and apply locally to HIV sequences
GNU General Public License v3.0
6 stars 3 forks source link

Alignment Program #71

Closed Kanyerezi30 closed 1 year ago

Kanyerezi30 commented 1 year ago

Hello here, From the Stanford University HIV Drug Resistance Database (HIVdb) Sierra web service, looks like the alignment program was changed from NucAmino to PostAlign (Version 3.2 update 2022-07-06). Is there a way of implementing this in the sierra-local and does it impact the final results. Below is the link to the sources of the updates

[https://hivdb.stanford.edu/page/program-updates/#version.2.2.0.update.2017-12-18.code.name.nucamino.](sierra program updates)

ArtPoon commented 1 year ago

Hi @Kanyerezi30 - thanks for the heads up. We will investigate. This might explain some other issues we're working on.

ArtPoon commented 1 year ago

Getting postalign working will probably resolve #69 but it is not clear how to get this program to work (need documentation)

ArtPoon commented 1 year ago

@GopiGugan @WilliamZekaiWang - I think we can get some idea of how to run postalign from this excerpt: https://github.com/hivdb/sierra-core/blob/df95732512776ff5269b192a4592f89b97480f71/src/main/java/edu/stanford/hivdb/sequences/PostAlignAligner.java#L152-L185

ArtPoon commented 1 year ago

Running into problems with relative imports when invoking postalign, look into modifying setup.py script in a fork of that repo

WilliamZekaiWang commented 1 year ago

Seems like the issue with running post align might come from not having cython version 0.29.32 and/or python3 version 3.9. I'll look into this and see if this might be the cause why

ArtPoon commented 1 year ago

Safest way to go about this is to build a virtual environment with the specific versions of Python and modules. Let me know if you need a hand with setting this up.

ArtPoon commented 1 year ago

@WilliamZekaiWang reports that he was able to run postalign, we may need to increase our version requirement for Python to 3.9

ArtPoon commented 1 year ago

We'll need to write a wrapper function for postalign in Python, see nucamino wrapper function for reference: https://github.com/PoonLab/sierra-local/blob/a89b555a103ab5cc25e5ae01c9ab5931b14875c5/sierralocal/nucaminohook.py#L148-L180

ArtPoon commented 1 year ago

since postalign seems to be a proper Python module, however, we could just import it and direct call its functions within the script

ArtPoon commented 1 year ago

@GopiGugan has postalign running on the terminal but it is hanging (may be waiting for expected input from stdin?) @WilliamZekaiWang was only able to run postalign within Python but not on terminal - need to compare differences in system configs with Gopi

ArtPoon commented 1 year ago

@WilliamZekaiWang can you please update this issue with the current results from running postalign?

WilliamZekaiWang commented 1 year ago

I did more testing with postalign and found it could solve issue #69. Using the incorrectly aligned sequences with postalign and comparing it to sierrapy, the total aligned nucleotide count matched with both sequences.

In terms of issue #65 postalign doesn't exactly solve it This is sierrapy mutations field:

          {
            "consensus": "V",
            "position": 35,
            "AAs": "M",
            "isInsertion": false,
            "isDeletion": false,
            "isApobecMutation": false,
            "isApobecDRM": false,
            "isUnusual": false,
            "isSDRM": false,
            "hasStop": false,
            "primaryType": "Other",
            "text": "V35M"
          }

and this is postalign's output:

            {
              "Position": 35,
              "RefCodonText": "GTA",
              "CodonText": "ATG",
              "RefAminoAcidText": "V",
              "AminoAcidText": "M",
              "InsertedCodonsText": "",
              "IsInsertion": false,
              "IsDeletion": false
            }

However, given that they provide the position, codon, and amino acid, I can port over the outputs I've created from nucamino to postalign for the missing fields

WilliamZekaiWang commented 1 year ago

I managed to create a basic function to change the output of post-align to match that of nucAmino so we can integrate it into sierra-local

There are a few differences I found within the outputs of nucAmino and Post-align:

          "AlignedSites": [
            {
              "PosAA": 155,
              "PosNAs": [
                1,
                2,
                3
              ],
              "LengthNA": 3
            },

Whereas nucAmino gives

"AlignedSites": [
            {
                "LengthNA": 3,
                "PosAA": 156,
                "PosNA": 1
            },

post-align:

          "Mutations": [
            {
              "Position": 189,
              "RefCodonText": "GTA",
              "CodonText": "ACA",
              "RefAminoAcidText": "V",
              "AminoAcidText": "T",
              "InsertedCodonsText": "",
              "IsInsertion": false,
              "IsDeletion": false
            },

nucAmino:

        "Mutations": [
            {
                "AminoAcidText": "T",
                "CodonText": "ACA",
                "Control": "...",
                "InsertedAminoAcidsText": "",
                "InsertedCodonsText": "",
                "IsDeletion": false,
                "IsInsertion": false,
                "IsPartial": false,
                "NAPosition": 103,
                "Position": 190,
                "ReferenceText": "V"
            },

Post-align doesn't give a few fields. Of these missing fields, they are not in the sierrapy output for mutations.

WilliamZekaiWang commented 1 year ago

ran into another issue where calling postalign using a subprocess and the sequence:

>AF227710.JF02-8w.B.80
GGGATGGATGGCCCAAGAGTTAAACAATGGCCATTAACAGAAGAAAAAATAAAAGCATTAATAGAAATTTGTACAGAATTGGAAGAGGAAGGAAAAATTACAAAAATTGGGMMTGAAAATMMATACAATACTMCAGTATTTGCCATAAAGAAGAAA...AGTGGTAGATGGAGAAAAATARCAGATTTCAGAGAACTTAATAAGAGAACTCAAGACTTCTGGGAAGTCCAATTAGGAATACCACATCCTGGAGGGTTAAAAAAGAAYAAATCAGTAACAGTACTGGATGTGGGTGATGCATATTTCTCAGTTCCCTTAGATGAAGACTTCAGGAAGTACACTGCATTTACCATACCTAGTATGAACAATGAGACACCAGGGATTAGATATCAGTACAATGTGCTWCCACAGGGATGGAAAGGATCACCAGCAATATTCCAAAGTAGCATGACAAAAATCTTAGAGCCTTTTAGAAAACAAAAYCCAGACATAATCATCTATCAATACGTGGATGATTTGTATGTAGGATCTGACTTAGAAATAGAGCAGCATAGAACAAAAATAGAGGAACTGAGACAACATCTGTTGAGGTGGGGATTTTTTACACCAGACCAAAAACATCAGAAAGAACCCCCATTCCATTGGATGGGTTATGAACTCCATCCTGATAAATGGACAGTACAGCCTATAGTGCTGCCAGAAAAAGACAGCTGGACTGTCAATGACATACAGAAGTTAGTGGGAAAATTGAATTGG

Another sequence which has the same error"

>AF009410.92RW026.C.19
CCAATTAGTCCCATTGAAACTGTACCAGTAAAATTAAAGCCAGGAATGGATGGCCCAAAGGTTAAACAATGGCCATTGACAGAAGAAAAAATAAAAGCATTAACAGAAATTTGTAGAGAAATGGAAAAGGAAGGAAAAATTTCAAAAATTGGGCCTGAAAATCCATATAACACTCCAGTATTTGCCATAAAAAAGAAGGACAGTACTAAGTGGAGAAAATTAGTAGACTTCAGGGAACTCAATAAAAGAACTCAAGACTTTTGGGAAGTTCAGTTAGGGATACCGCAC.CAGCAGGTCTAAAAAAGAAGAAATCAGTAACAGTACTAGATGTGGGGGATGCATATTTCTCAGTTCCTTTAGATGAAGG.TTTAGGAA.TATACTGCATTCACCATACCTAGTATAAACAATGAAACACCAGGGATTAGATATCAGTATAATGTGCTTCCACAGGGATGGAAAGGATCACCATCAATATTCCAGAGTAGCATGACAAAAATTTTAGAGCCCTTTAGGGCACAAAACCCAGAAATGGTTATCTATCAATATATGGATGACTTGTATGTAGGATCTGACTTAGAAATAGGGCAACATAGAGCAAAAATAGAGGAGTTAAGAGGACATTTATTGAAGTGGGGATTTACCACACCAGACAAGAAACATCAGAAAGAACCCCCATTTCTTTGGATGGGGTATGAACTCCATCCTGACAAATGGACAGTACAGCCTATACA.CTGCCAGAGAAGGATAGCTGGACTGTCAATGATATACAGAAGTTAGTGGGAAAATTAAACTGGGCAAGTCAGATTTACCCAGGGATTAAGGTAAAGCAACTGTGTAAACTCCTTAGGGGAGCCAAAGCACTAACAGAAATAGTATCACTGACTGAA

This would result in a subprocess error: returned non-exit status 1. Not too sure if the issue lies with post-align or the config file, as other sequences in the automatically generated RT file provided no errors, since this sequence passes through sierrapy fine

WilliamZekaiWang commented 1 year ago

the second sequence (AF009410.92RW026.C.19) is also one of the sequences that nucAmino incorrectly gives the NA sequence for listed in #69

ArtPoon commented 1 year ago

Try re-running these two sequences with all . characters removed, to see if we still get the same error

WilliamZekaiWang commented 1 year ago

The error above is fixed by removing all illegal characters.

WilliamZekaiWang commented 1 year ago

The mutation position for postalign seem to be consistently 1 lower than that of sierrapy

post:

"mutations": [
          {
            "consensus": "V",
            "position": 34,
            "AAs": "T",
            "isInsertion": false,
            "isDeletion": false,
            "isApobecDRM": false
          },
          {
            "consensus": "E",
            "position": 35,
            "AAs": "A",
            "isInsertion": false,
            "isDeletion": false,
            "isApobecDRM": false
          },

sierrapy

        "mutations": [
          {
            "consensus": "V",
            "position": 35,
            "AAs": "T",
            "isInsertion": false,
            "isDeletion": false,
            "isApobecMutation": false,
            "isApobecDRM": false,
            "isUnusual": false,
            "isSDRM": false,
            "hasStop": false,
            "primaryType": "Other",
            "text": "V35T"
          },
          {
            "consensus": "E",
            "position": 36,
            "AAs": "A",
            "isInsertion": false,
            "isDeletion": false,
            "isApobecMutation": false,
            "isApobecDRM": false,
            "isUnusual": false,
            "isSDRM": false,
            "hasStop": false,
            "primaryType": "Other",
            "text": "E36A"
          },

Currently, adding 1 to position causes everything to match sierrapy properly. Will keep looking for a more sound solution

ArtPoon commented 1 year ago

This is probably a Python 0-index issue (coordinates are being reported in 1-index)

WilliamZekaiWang commented 1 year ago

I've ported over the functions I've made in #65. Out of the 1326, different mutations, the functions I've added incorrectly label 83 of them. These incorrectly labelled mutations don't seem to be random, they are consistently being mislabeled at some positions (ie. position 68 amino acid G is in the SDRM_HIV1.csv file and should be labelled as isSDRM: True, but sierrapy reports mutations resulting in 68G as isSDRM: False)

more specifically

perhaps there is a more updated file sierrapy is using that I cannot find?

WilliamZekaiWang commented 1 year ago

seems like I was using an old csv file to detect isSDRM. I found a newer file on Hivdb/facts and it fixed nearly all the issues. The isApobecMutation seemed to be an issue with my function; I fixed it and the majority of the misreports were fixed.

These files were updated in the last 6 months, would it be more appropriate to keep the files locally or use another submodule?

The remaining errors, 14 to be exact, are all from the amino acid being reported as X, as reported in #65

ArtPoon commented 1 year ago

@Kanyerezi30 please try pulling a new version from the updated master branch and see if this resolves your issue. Sorry for the long delay! Closing for now.