Sierra-local output file does not match with Sierrapy

GopiGugan commented 2 years ago

Output JSON files from both sierra-local and sierrapy :

Main differences:

1) Mutation has extra isApobecMutation, isUnusual, isSDRM, hasStop, primaryType and text fields

sierra-local	sierrapy
```json "mutations": [ { "consensus": "E", "position": 6, "AAs": "D", "isInsertion": false, "isDeletion": false, "isApobecDRM": false }, ] ```	```json "mutations": [ { "consensus": "E", "position": 6, "AAs": "D", "isInsertion": false, "isDeletion": false, "isApobecMutation": false, "isApobecDRM": false, "isUnusual": false, "isSDRM": false, "hasStop": false, "primaryType": "Other", "text": "E6D" }, ] ```

Addition of the level field

sierra-local	sierrapy
```json "drugScores": [ { "drugClass": { "name": "NRTI" }, "drug": { "name": "ABC", "displayAbbr": "ABC" }, "score": 0.0, "partialScores": [], "text": "Susceptible" }, ] ```	```json "drugScores": [ { "drugClass": { "name": "NRTI" }, "drug": { "name": "ABC", "displayAbbr": "ABC" }, "score": 0.0, "level": 1, "partialScores": [], "text": "Susceptible" }, ] ```

Addition of the SDRMs, alignedNAs, alignedAAs, prettyPairwise fields

  "SDRMs": [],
  "alignedNAs": "CCAATTAGTCCTATTGACACTGTACCAGTAACATTAAAGCCAGGAATGGATGGACCAAAGGTTAAACAGTGGCCATTGACAGAAGAAAAAATAAAAGCATTAACAGAAATTTGTAAAGAGATGGAAGAGGAAGGAAAAATCTCAAAAATTGGGCCTGAAAATCCATACAATACTCCAGTATTTGCTATAAAGAAAAAGGACAGCACCAAATGGAGGAAATTAGTAGATTTCAGAGAGCTCAATAAAAGAACTCAGGACTTTTGGGAAGTTCAATTAGGAATACCGCATCCAGCAGGTTTAAAAAAGAAAAAATCAGTAACAGTACTAGATGTGGGAGATGCATATTTTTCAGTTCCTTTAGATGAAAGCTTTAGAAAGTATACTGCATTCACCATACCTAGTATAAACAATGAGACACCAGGAATCAGATATCAGTACAATGTGCTGCCACAGGGATGGAAAGGATCACCGGCAATATTCCAGAGTAGCATGACAAAAATCTTAGAGCCCTTTAGAATAAAAAATCCAGAAATGGTTATCTATCAATACAAGGATGACTTGTATGTAGGATCTGATTTAGAAATAGGGCAGCACAGAACAAAAATAGAGGAGCTAAGAGCTCATCTATTGAGCTGGGGATTTACTACACCAGACAAAAAGCATCAGAAGGAACCTCCATTCCTTTGGATGGGATATGAACTCCATCCTGACAGATGGACAGTCCAGCCTATAGAACTGCCAGAAAAAGACAGCTGGACTGTCAATGATATACAGAAATTAGTGGGAAAACTAAATTGGGCAAGTCAAATTTATGCAGGGATTAAGGTAAAGCAACTGTGTAAACTCCTCAGGGGAGCTAAAGCACTAACAGACATAGTACCACTGACTGAAGAAGCAGAATTAGAGTTGGCAGAGAACAGGGAGATTCTAAAAACCCCTGTGCATGGAGTATATTATGACCCATCAAAAGACTTAGTAGCAGAAGTACAGAAACAAGGGCAGGACCAATGGACATATCAAATTTATCAAGAGCCATTTAAAAATCTAAAAACAGGAAAATATGCCAGAAGAGGGTCTGCTCACACTAATGATGTAAGACAATTAACAGAAGTGGTGCAAAAAGTAGCCACAGAAAGCATAGTAATATGGGGAAAGACCCCTAAATTTAGACTACCCATACAAAGAGAAACATGGGAAACATGGTGGATGGAGTATTGGCAGGCTACCTGGATTCCTGAATGGGAGTTTGTTAATACCCCTCCTCTAGTAAAATTATGGTACCAATTAGAAAAAGACCCCATAGTAGGAGCAGAGACTTTCTATGTAGATGGGGCAGCTAGTAGGGAGACTAAGCTAGGAAAAGCAGGGTATGTCACTGACAGAGGAAGACAAAAGGTAGTTTCCCTAACTGAGACAACAAATCAAAAGACTGAATTACATGCGATCCATTTAGCCTTGCAGGATTCAGGATCAGAAGTAAATATAGTAACAGACTCACAATATGCATTAGGAATCATTCAGGCACAACCAGACAGGAGTGAATCAGAAGTAGTCAACCAAATAATAGAGGAGCTAATAAAAAAGGAGAAAGTCTACCTGTCATGGGTACCAGCACACAAGGGGATTGGAGGAAATGAACAAGTAGATAAATTAGTCAGTTCAGGAATCAGGAAGGTGCTA",
  "alignedAAs": "PISPIDTVPVTLKPGMDGPKVKQWPLTEEKIKALTEICKEMEEEGKISKIGPENPYNTPVFAIKKKDSTKWRKLVDFRELNKRTQDFWEVQLGIPHPAGLKKKKSVTVLDVGDAYFSVPLDESFRKYTAFTIPSINNETPGIRYQYNVLPQGWKGSPAIFQSSMTKILEPFRIKNPEMVIYQYKDDLYVGSDLEIGQHRTKIEELRAHLLSWGFTTPDKKHQKEPPFLWMGYELHPDRWTVQPIELPEKDSWTVNDIQKLVGKLNWASQIYAGIKVKQLCKLLRGAKALTDIVPLTEEAELELAENREILKTPVHGVYYDPSKDLVAEVQKQGQDQWTYQIYQEPFKNLKTGKYARRGSAHTNDVRQLTEVVQKVATESIVIWGKTPKFRLPIQRETWETWWMEYWQATWIPEWEFVNTPPLVKLWYQLEKDPIVGAETFYVDGAASRETKLGKAGYVTDRGRQKVVSLTETTNQKTELHAIHLALQDSGSEVNIVTDSQYALGIIQAQPDRSESEVVNQIIEELIKKEKVYLSWVPAHKGIGGNEQVDKLVSSGIRKVL",
  "prettyPairwise": {
    "positionLine": [
    ],
    "refAALine": [
    ],

    "alignedNAsLine": [
    ],
    "mutationLine": [
    ]

GopiGugan commented 2 years ago

Use timing.py to generate sierrapy output file

WilliamZekaiWang commented 2 years ago

Potential explanations for the missing fields, in regards to mutation output, found through analyzing files under: sierra-core/src/main/java/edu/stanford/hivdb/mutations/AAmutations.java and hivdb/hivfacts/data.

isApobecMutation: a list of APOBEC Mutations found on the hivdb-database website matches the apobec.json files found in hivfacts.
isUnusual: Under the datafiles section, Unusual mutations, the hivdb-database website shows the following CSV file. The boolean values for isUnusual in the CSV match the JSON file, mutation-type-pairs_hiv1.json and mutation-type-pairs_hiv2.json found in hivfacts.
isSDRM: The compilation of the SDRM resulting mutation on the hivdb-stanford website for surveillance mutations matches with sdrm_hiv1 and sdrm_hiv2 in hivfacts/data.
isPrimaryMutations: in the AAMutation.java file, isPrimaryMutation seems to just call for "mutation type", which is found the same file with the boolean values of isUnusual in mutation-type-pairs_hiv1.json and mutation-type-pairs_hiv2.json
hasStop: checks if the resulting mutation results in a stop codon by comparing it to the amino acid sequence
text: output is the concatenation of consensus position AAs outputs in that order

WilliamZekaiWang commented 2 years ago

Comparing results to sierrapy output with the test fasta file with JSON files in hivdb

isApobecMutation when true, always matches with a mutation listed in apobec.json

isSDRM is in the same case with isApobecMutation as it always matched with sdrms_hiv

primaryType seems to hold true too with mutation-type-pairs.json

isUnusual was consistent with the JSON files, and the CSV file. If a singular AA matched with the CSV file for True it would return True.

level for drugs is determined off of the calculated score, where on the HIVdb website it summarizes states that

a score of <10 results in level 1
a score of <15 but >=10 results in level 2
a score of <30 but >=15 results in level 3
a score of <60 but >= 30 results in level 4
a score of >= 60 results in level 5 this was seen as consistent with their output file from sierrapy

WilliamZekaiWang commented 2 years ago

The ordering of Amino Acids in mutations for sierrapy and sierra-local isn't always consistent in the order they are written	Sierrapy	Sierra-Local
{ "consensus": "D", "position": 185, "AAs": "DE", "isInsertion": false, "isDeletion": false, "isApobecMutation": false, "isApobecDRM": false, "isUnusual": true, "isSDRM": false, "hasStop": false, "primaryType": "Other", "text": "D185DE" },	{ "consensus": "D", "position": 185, "AAs": "ED", "isInsertion": false, "isDeletion": false, "isApobecDRM": false, "text": "D185ED" },

Where in sierra-local it states ED instead of DE

WilliamZekaiWang commented 2 years ago

Sorting the AAs seemed to fix the issue above.

All mutations fields seem to work except isUnusual. Going to try with the CSV file on the hivdb download page

WilliamZekaiWang commented 2 years ago

isUnusual seems to work with the csv file Adding SDRMs, alignedNAs, alignedAAs, prettyPairwise, seems as they are just outputting already analyzed data

ArtPoon commented 2 years ago

Converting CSV into a nested dictionary is always faster than linear search:

import csv
rx = {}
reader = csv.DictReader(open("rx-all_subtype-all.csv", encoding='utf-8-sig'))

for row in reader:
    gene = row['gene']
    pos = row['position']
    aa = row['aa']
    unusual = row['isUnusual']
    if gene not in rx:
        rx.update({gene: {}})
    if pos not in rx[gene]:
        rx[gene].update({pos: {}})
    rx[gene][pos].update({aa: unusual})

WilliamZekaiWang commented 1 year ago

All fields have been added and seem to be working. positionLine was created by taking the first and last AA positions and making a list with all values in between. This is working with the RT.file.

primaryType detects when the mutation isn't just "other" when compared to the associated file in HIVfacts but returns an incorrect response for which drug.

There also seem to be more fields missing

Sierralocal

Sierrapy

``` { "inputSequence": { "header": "U54771.CM240.CRF01_AE.0" }, "subtypeText": "", "validationResults": [], ```

``` [ { "inputSequence": { "header": "U54771.CM240.CRF01_AE.0", "SHA512": "873aa696499f3eaba28a2e39853afbe337373222913d13d4030d2b2cfb8bdf7acb81b98fab1019e48d9daea263e78a7f1a685eac42bc337c38ebc8a92e019325" }, "strain": { "name": "HIV1" }, "subtypeText": "CRF01_AE (0.65%)", "validationResults": [ { "level": "NOTE", "message": "There is one unusual mutation at a drug-resistance position in RT: M184K." } ], ```

Addition of the SHA512 , strain and name fields

Sierralocal also does not seem to output the same validationResults with the appropriate message and level

WilliamZekaiWang commented 1 year ago

I was checking with HIV2A/B mutations rather than HIV1 for validationResults, it seems to be fine

WilliamZekaiWang commented 1 year ago

When there is an validationResults output for sierrapy with the statement "... positions were not sequenced or aligned", sierralocal incorrectly assumes one extra AA

<     "validationResults": [],
---
>     "validationResults": [
>       {
>         "level": "WARNING",
>         "message": "203 positions were not sequenced or aligned: RT 358-560. However, none is at drug-resistance position."
>       }
>     ],
5c10
<         "lastAA": 358,
---
>         "lastAA": 357,
8,9c13,16
<           "length": null
<         },
---
>           "length": 560
>         },
>         "mutations": [
>           {

WilliamZekaiWang commented 1 year ago

sha512 output was the sha512 of the given NA sequence

ArtPoon commented 1 year ago

Please manually compare the raw sequences that are reported to have different lengths. Is the extra character due to an invisible Unicode symbol?

ArtPoon commented 1 year ago

For classifying input sequences into HIV-1 versus HIV-2, we might need to use their pairwise alignment method (nucamino) to align each input sequence to a consensus genome sequence of each "strain" and count differences for a crude distance measure, and classify accordingly. Let's put this off as a separate issue.

ArtPoon commented 1 year ago

Make validation a separate issue.

WilliamZekaiWang commented 1 year ago

an issue with AAs output in mutations from sierralocal and sierrapy. Sequences are the sequences from the test RT file

Sequence: AF090509.260Pt28BC.B.12

Sierra Local {'consensus': 'T', 'position': 27, 'AAs': 'X',

Sierrapy {'consensus': 'T', 'position': 27, 'AAs': 'IKMNRST',

Sequence: AF021275.JIDP16W0.B.15

Sierra local {'consensus': 'K', 'position': 220, 'AAs': 'X'

Sierrapy  {'consensus': 'K', 'position': 220, 'AAs': '*ACDEFGHIKLMNPQRSTVWY'

Sequence: AF021280.JIDP7W0.B.16

Sierra local {'consensus': 'K', 'position': 219, 'AAs': 'X'

Sierrapy  {'consensus': 'K', 'position': 219, 'AAs': 'IKMNRST'

There are more like these. However, the text field shows (for AF090509.260Pt28BC.B.12) T27X in both sierralocal and sierrapy.

ArtPoon commented 1 year ago

See if updating the nucamino binaries resolves this

WilliamZekaiWang commented 1 year ago

The X amino acid problem still persists in post-align. Perhaps it's not an issue with the alignment program but the way we are interpreting indels, as we immediately categorize them as X.

Will look deeper into the functions we've already made

ArtPoon commented 1 year ago

@WilliamZekaiWang to merge dev branch into postalign to see if this resolves the above issue, then do a PR from postalign to dev and complete this issue in the dev branch.

commit your local changes to postalign
checkout dev
pull latest changes in dev from remote to local
checkout postalign
git merge dev to incorporate changes from dev into postalign
after resolving any conflicts, push postalign to remote and issue a PR from postalign to dev

PoonLab / sierra-local

Sierra-local output file does not match with Sierrapy #65