PoonLab / sierra-local

Retrieve HIVdb algorithm as XML and apply locally to HIV sequences
GNU General Public License v3.0
6 stars 4 forks source link

Sierra-local output file does not match with Sierrapy #65

Closed GopiGugan closed 1 year ago

GopiGugan commented 2 years ago

Output JSON files from both sierra-local and sierrapy :

output_files.zip

Main differences:

1) Mutation has extra isApobecMutation, isUnusual, isSDRM, hasStop, primaryType and text fields

sierra-local sierrapy
```json "mutations": [ { "consensus": "E", "position": 6, "AAs": "D", "isInsertion": false, "isDeletion": false, "isApobecDRM": false }, ] ``` ```json "mutations": [ { "consensus": "E", "position": 6, "AAs": "D", "isInsertion": false, "isDeletion": false, "isApobecMutation": false, "isApobecDRM": false, "isUnusual": false, "isSDRM": false, "hasStop": false, "primaryType": "Other", "text": "E6D" }, ] ```
  1. Addition of the level field
sierra-local sierrapy
```json "drugScores": [ { "drugClass": { "name": "NRTI" }, "drug": { "name": "ABC", "displayAbbr": "ABC" }, "score": 0.0, "partialScores": [], "text": "Susceptible" }, ] ``` ```json "drugScores": [ { "drugClass": { "name": "NRTI" }, "drug": { "name": "ABC", "displayAbbr": "ABC" }, "score": 0.0, "level": 1, "partialScores": [], "text": "Susceptible" }, ] ```
  1. Addition of the SDRMs, alignedNAs, alignedAAs, prettyPairwise fields
  "SDRMs": [],
  "alignedNAs": "CCAATTAGTCCTATTGACACTGTACCAGTAACATTAAAGCCAGGAATGGATGGACCAAAGGTTAAACAGTGGCCATTGACAGAAGAAAAAATAAAAGCATTAACAGAAATTTGTAAAGAGATGGAAGAGGAAGGAAAAATCTCAAAAATTGGGCCTGAAAATCCATACAATACTCCAGTATTTGCTATAAAGAAAAAGGACAGCACCAAATGGAGGAAATTAGTAGATTTCAGAGAGCTCAATAAAAGAACTCAGGACTTTTGGGAAGTTCAATTAGGAATACCGCATCCAGCAGGTTTAAAAAAGAAAAAATCAGTAACAGTACTAGATGTGGGAGATGCATATTTTTCAGTTCCTTTAGATGAAAGCTTTAGAAAGTATACTGCATTCACCATACCTAGTATAAACAATGAGACACCAGGAATCAGATATCAGTACAATGTGCTGCCACAGGGATGGAAAGGATCACCGGCAATATTCCAGAGTAGCATGACAAAAATCTTAGAGCCCTTTAGAATAAAAAATCCAGAAATGGTTATCTATCAATACAAGGATGACTTGTATGTAGGATCTGATTTAGAAATAGGGCAGCACAGAACAAAAATAGAGGAGCTAAGAGCTCATCTATTGAGCTGGGGATTTACTACACCAGACAAAAAGCATCAGAAGGAACCTCCATTCCTTTGGATGGGATATGAACTCCATCCTGACAGATGGACAGTCCAGCCTATAGAACTGCCAGAAAAAGACAGCTGGACTGTCAATGATATACAGAAATTAGTGGGAAAACTAAATTGGGCAAGTCAAATTTATGCAGGGATTAAGGTAAAGCAACTGTGTAAACTCCTCAGGGGAGCTAAAGCACTAACAGACATAGTACCACTGACTGAAGAAGCAGAATTAGAGTTGGCAGAGAACAGGGAGATTCTAAAAACCCCTGTGCATGGAGTATATTATGACCCATCAAAAGACTTAGTAGCAGAAGTACAGAAACAAGGGCAGGACCAATGGACATATCAAATTTATCAAGAGCCATTTAAAAATCTAAAAACAGGAAAATATGCCAGAAGAGGGTCTGCTCACACTAATGATGTAAGACAATTAACAGAAGTGGTGCAAAAAGTAGCCACAGAAAGCATAGTAATATGGGGAAAGACCCCTAAATTTAGACTACCCATACAAAGAGAAACATGGGAAACATGGTGGATGGAGTATTGGCAGGCTACCTGGATTCCTGAATGGGAGTTTGTTAATACCCCTCCTCTAGTAAAATTATGGTACCAATTAGAAAAAGACCCCATAGTAGGAGCAGAGACTTTCTATGTAGATGGGGCAGCTAGTAGGGAGACTAAGCTAGGAAAAGCAGGGTATGTCACTGACAGAGGAAGACAAAAGGTAGTTTCCCTAACTGAGACAACAAATCAAAAGACTGAATTACATGCGATCCATTTAGCCTTGCAGGATTCAGGATCAGAAGTAAATATAGTAACAGACTCACAATATGCATTAGGAATCATTCAGGCACAACCAGACAGGAGTGAATCAGAAGTAGTCAACCAAATAATAGAGGAGCTAATAAAAAAGGAGAAAGTCTACCTGTCATGGGTACCAGCACACAAGGGGATTGGAGGAAATGAACAAGTAGATAAATTAGTCAGTTCAGGAATCAGGAAGGTGCTA",
  "alignedAAs": "PISPIDTVPVTLKPGMDGPKVKQWPLTEEKIKALTEICKEMEEEGKISKIGPENPYNTPVFAIKKKDSTKWRKLVDFRELNKRTQDFWEVQLGIPHPAGLKKKKSVTVLDVGDAYFSVPLDESFRKYTAFTIPSINNETPGIRYQYNVLPQGWKGSPAIFQSSMTKILEPFRIKNPEMVIYQYKDDLYVGSDLEIGQHRTKIEELRAHLLSWGFTTPDKKHQKEPPFLWMGYELHPDRWTVQPIELPEKDSWTVNDIQKLVGKLNWASQIYAGIKVKQLCKLLRGAKALTDIVPLTEEAELELAENREILKTPVHGVYYDPSKDLVAEVQKQGQDQWTYQIYQEPFKNLKTGKYARRGSAHTNDVRQLTEVVQKVATESIVIWGKTPKFRLPIQRETWETWWMEYWQATWIPEWEFVNTPPLVKLWYQLEKDPIVGAETFYVDGAASRETKLGKAGYVTDRGRQKVVSLTETTNQKTELHAIHLALQDSGSEVNIVTDSQYALGIIQAQPDRSESEVVNQIIEELIKKEKVYLSWVPAHKGIGGNEQVDKLVSSGIRKVL",
  "prettyPairwise": {
    "positionLine": [
    ],
    "refAALine": [
    ],

    "alignedNAsLine": [
    ],
    "mutationLine": [
    ]
GopiGugan commented 2 years ago

Use timing.py to generate sierrapy output file

WilliamZekaiWang commented 2 years ago

Potential explanations for the missing fields, in regards to mutation output, found through analyzing files under: sierra-core/src/main/java/edu/stanford/hivdb/mutations/AAmutations.java and hivdb/hivfacts/data.

WilliamZekaiWang commented 2 years ago

Comparing results to sierrapy output with the test fasta file with JSON files in hivdb

isApobecMutation when true, always matches with a mutation listed in apobec.json

isSDRM is in the same case with isApobecMutation as it always matched with sdrms_hiv

primaryType seems to hold true too with mutation-type-pairs.json

isUnusual was consistent with the JSON files, and the CSV file. If a singular AA matched with the CSV file for True it would return True.

level for drugs is determined off of the calculated score, where on the HIVdb website it summarizes states that

WilliamZekaiWang commented 2 years ago
The ordering of Amino Acids in mutations for sierrapy and sierra-local isn't always consistent in the order they are written Sierrapy Sierra-Local
{
"consensus": "D",
"position": 185,
"AAs": "DE",
"isInsertion": false,
"isDeletion": false,
"isApobecMutation": false,
"isApobecDRM": false,
"isUnusual": true,
"isSDRM": false,
"hasStop": false,
"primaryType": "Other",
"text": "D185DE"
},
{
"consensus": "D",
"position": 185,
"AAs": "ED",
"isInsertion": false,
"isDeletion": false,
"isApobecDRM": false,
"text": "D185ED"
},

Where in sierra-local it states ED instead of DE

WilliamZekaiWang commented 2 years ago

Sorting the AAs seemed to fix the issue above.

All mutations fields seem to work except isUnusual. Going to try with the CSV file on the hivdb download page

WilliamZekaiWang commented 2 years ago

isUnusual seems to work with the csv file Adding SDRMs, alignedNAs, alignedAAs, prettyPairwise, seems as they are just outputting already analyzed data

ArtPoon commented 2 years ago

Converting CSV into a nested dictionary is always faster than linear search:

import csv
rx = {}
reader = csv.DictReader(open("rx-all_subtype-all.csv", encoding='utf-8-sig'))

for row in reader:
    gene = row['gene']
    pos = row['position']
    aa = row['aa']
    unusual = row['isUnusual']
    if gene not in rx:
        rx.update({gene: {}})
    if pos not in rx[gene]:
        rx[gene].update({pos: {}})
    rx[gene][pos].update({aa: unusual})
WilliamZekaiWang commented 1 year ago

All fields have been added and seem to be working. positionLine was created by taking the first and last AA positions and making a list with all values in between. This is working with the RT.file.

primaryType detects when the mutation isn't just "other" when compared to the associated file in HIVfacts but returns an incorrect response for which drug.

There also seem to be more fields missing

Sierralocal Sierrapy
``` { "inputSequence": { "header": "U54771.CM240.CRF01_AE.0" }, "subtypeText": "", "validationResults": [], ``` ``` [ { "inputSequence": { "header": "U54771.CM240.CRF01_AE.0", "SHA512": "873aa696499f3eaba28a2e39853afbe337373222913d13d4030d2b2cfb8bdf7acb81b98fab1019e48d9daea263e78a7f1a685eac42bc337c38ebc8a92e019325" }, "strain": { "name": "HIV1" }, "subtypeText": "CRF01_AE (0.65%)", "validationResults": [ { "level": "NOTE", "message": "There is one unusual mutation at a drug-resistance position in RT: M184K." } ], ```

Addition of the SHA512 , strain and name fields

Sierralocal also does not seem to output the same validationResults with the appropriate message and level

WilliamZekaiWang commented 1 year ago

I was checking with HIV2A/B mutations rather than HIV1 for validationResults, it seems to be fine

WilliamZekaiWang commented 1 year ago

When there is an validationResults output for sierrapy with the statement "... positions were not sequenced or aligned", sierralocal incorrectly assumes one extra AA

<     "validationResults": [],
---
>     "validationResults": [
>       {
>         "level": "WARNING",
>         "message": "203 positions were not sequenced or aligned: RT 358-560. However, none is at drug-resistance position."
>       }
>     ],
5c10
<         "lastAA": 358,
---
>         "lastAA": 357,
8,9c13,16
<           "length": null
<         },
---
>           "length": 560
>         },
>         "mutations": [
>           {
WilliamZekaiWang commented 1 year ago

sha512 output was the sha512 of the given NA sequence

ArtPoon commented 1 year ago

Please manually compare the raw sequences that are reported to have different lengths. Is the extra character due to an invisible Unicode symbol?

ArtPoon commented 1 year ago

For classifying input sequences into HIV-1 versus HIV-2, we might need to use their pairwise alignment method (nucamino) to align each input sequence to a consensus genome sequence of each "strain" and count differences for a crude distance measure, and classify accordingly. Let's put this off as a separate issue.

ArtPoon commented 1 year ago

Make validation a separate issue.

WilliamZekaiWang commented 1 year ago

an issue with AAs output in mutations from sierralocal and sierrapy. Sequences are the sequences from the test RT file

Sequence: AF090509.260Pt28BC.B.12

Sierra Local {'consensus': 'T', 'position': 27, 'AAs': 'X',

Sierrapy {'consensus': 'T', 'position': 27, 'AAs': 'IKMNRST',

Sequence: AF021275.JIDP16W0.B.15

Sierra local {'consensus': 'K', 'position': 220, 'AAs': 'X'

Sierrapy  {'consensus': 'K', 'position': 220, 'AAs': '*ACDEFGHIKLMNPQRSTVWY'

Sequence: AF021280.JIDP7W0.B.16

Sierra local {'consensus': 'K', 'position': 219, 'AAs': 'X'

Sierrapy  {'consensus': 'K', 'position': 219, 'AAs': 'IKMNRST'

There are more like these. However, the text field shows (for AF090509.260Pt28BC.B.12) T27X in both sierralocal and sierrapy.

ArtPoon commented 1 year ago

See if updating the nucamino binaries resolves this

WilliamZekaiWang commented 1 year ago

The X amino acid problem still persists in post-align. Perhaps it's not an issue with the alignment program but the way we are interpreting indels, as we immediately categorize them as X.

Will look deeper into the functions we've already made

ArtPoon commented 1 year ago

@WilliamZekaiWang to merge dev branch into postalign to see if this resolves the above issue, then do a PR from postalign to dev and complete this issue in the dev branch.