Closed GopiGugan closed 1 year ago
Potential explanations for the missing fields, in regards to mutation output, found through analyzing files under: sierra-core/src/main/java/edu/stanford/hivdb/mutations/AAmutations.java and hivdb/hivfacts/data.
isApobecMutation
: a list of APOBEC Mutations found on the hivdb-database website matches the apobec.json files found in hivfacts.
isUnusual
: Under the datafiles section, Unusual mutations, the hivdb-database website shows the following CSV file. The boolean values for isUnusual in the CSV match the JSON file, mutation-type-pairs_hiv1.json and mutation-type-pairs_hiv2.json found in hivfacts.
isSDRM
: The compilation of the SDRM resulting mutation on the hivdb-stanford website for surveillance mutations matches with sdrm_hiv1 and sdrm_hiv2 in hivfacts/data.
isPrimaryMutations
: in the AAMutation.java file, isPrimaryMutation seems to just call for "mutation type", which is found the same file with the boolean values of isUnusual in mutation-type-pairs_hiv1.json and mutation-type-pairs_hiv2.json
hasStop
: checks if the resulting mutation results in a stop codon by comparing it to the amino acid sequence
text
: output is the concatenation of consensus
position
AAs
outputs in that order
Comparing results to sierrapy output with the test fasta file with JSON files in hivdb
isApobecMutation
when true, always matches with a mutation listed in apobec.json
isSDRM
is in the same case with isApobecMutation as it always matched with sdrms_hiv
primaryType
seems to hold true too with mutation-type-pairs.json
isUnusual
was consistent with the JSON files, and the CSV file. If a singular AA matched with the CSV file for True it would return True.
level
for drugs is determined off of the calculated score, where on the HIVdb website it summarizes states that
The ordering of Amino Acids in mutations for sierrapy and sierra-local isn't always consistent in the order they are written | Sierrapy | Sierra-Local |
---|---|---|
{ "consensus": "D", "position": 185, "AAs": "DE", "isInsertion": false, "isDeletion": false, "isApobecMutation": false, "isApobecDRM": false, "isUnusual": true, "isSDRM": false, "hasStop": false, "primaryType": "Other", "text": "D185DE" }, |
{ "consensus": "D", "position": 185, "AAs": "ED", "isInsertion": false, "isDeletion": false, "isApobecDRM": false, "text": "D185ED" }, |
Where in sierra-local it states ED instead of DE
Sorting the AAs seemed to fix the issue above.
All mutations fields seem to work except isUnusual
. Going to try with the CSV file on the hivdb download page
isUnusual
seems to work with the csv file
Adding SDRMs
, alignedNAs
, alignedAAs
, prettyPairwise
, seems as they are just outputting already analyzed data
Converting CSV into a nested dictionary is always faster than linear search:
import csv
rx = {}
reader = csv.DictReader(open("rx-all_subtype-all.csv", encoding='utf-8-sig'))
for row in reader:
gene = row['gene']
pos = row['position']
aa = row['aa']
unusual = row['isUnusual']
if gene not in rx:
rx.update({gene: {}})
if pos not in rx[gene]:
rx[gene].update({pos: {}})
rx[gene][pos].update({aa: unusual})
All fields have been added and seem to be working. positionLine
was created by taking the first and last AA positions and making a list with all values in between. This is working with the RT.file.
primaryType
detects when the mutation isn't just "other" when compared to the associated file in HIVfacts but returns an incorrect response for which drug.
There also seem to be more fields missing
Sierralocal | Sierrapy |
``` { "inputSequence": { "header": "U54771.CM240.CRF01_AE.0" }, "subtypeText": "", "validationResults": [], ``` | ``` [ { "inputSequence": { "header": "U54771.CM240.CRF01_AE.0", "SHA512": "873aa696499f3eaba28a2e39853afbe337373222913d13d4030d2b2cfb8bdf7acb81b98fab1019e48d9daea263e78a7f1a685eac42bc337c38ebc8a92e019325" }, "strain": { "name": "HIV1" }, "subtypeText": "CRF01_AE (0.65%)", "validationResults": [ { "level": "NOTE", "message": "There is one unusual mutation at a drug-resistance position in RT: M184K." } ], ``` |
Addition of the SHA512
, strain
and name
fields
Sierralocal also does not seem to output the same validationResults
with the appropriate message
and level
I was checking with HIV2A/B mutations rather than HIV1 for validationResults
, it seems to be fine
When there is an validationResults
output for sierrapy with the statement "... positions were not sequenced or aligned", sierralocal incorrectly assumes one extra AA
< "validationResults": [],
---
> "validationResults": [
> {
> "level": "WARNING",
> "message": "203 positions were not sequenced or aligned: RT 358-560. However, none is at drug-resistance position."
> }
> ],
5c10
< "lastAA": 358,
---
> "lastAA": 357,
8,9c13,16
< "length": null
< },
---
> "length": 560
> },
> "mutations": [
> {
sha512 output was the sha512 of the given NA sequence
Please manually compare the raw sequences that are reported to have different lengths. Is the extra character due to an invisible Unicode symbol?
For classifying input sequences into HIV-1 versus HIV-2, we might need to use their pairwise alignment method (nucamino) to align each input sequence to a consensus genome sequence of each "strain" and count differences for a crude distance measure, and classify accordingly. Let's put this off as a separate issue.
Make validation a separate issue.
an issue with AAs output in mutations from sierralocal and sierrapy. Sequences are the sequences from the test RT file
Sequence: AF090509.260Pt28BC.B.12
Sierra Local {'consensus': 'T', 'position': 27, 'AAs': 'X',
Sierrapy {'consensus': 'T', 'position': 27, 'AAs': 'IKMNRST',
Sequence: AF021275.JIDP16W0.B.15
Sierra local {'consensus': 'K', 'position': 220, 'AAs': 'X'
Sierrapy {'consensus': 'K', 'position': 220, 'AAs': '*ACDEFGHIKLMNPQRSTVWY'
Sequence: AF021280.JIDP7W0.B.16
Sierra local {'consensus': 'K', 'position': 219, 'AAs': 'X'
Sierrapy {'consensus': 'K', 'position': 219, 'AAs': 'IKMNRST'
There are more like these. However, the text field shows (for AF090509.260Pt28BC.B.12) T27X in both sierralocal and sierrapy.
See if updating the nucamino
binaries resolves this
The X amino acid problem still persists in post-align. Perhaps it's not an issue with the alignment program but the way we are interpreting indels, as we immediately categorize them as X.
Will look deeper into the functions we've already made
@WilliamZekaiWang to merge dev
branch into postalign
to see if this resolves the above issue, then do a PR from postalign
to dev
and complete this issue in the dev
branch.
postalign
dev
dev
from remote to localpostalign
git merge dev
to incorporate changes from dev
into postalign
postalign
to remote and issue a PR from postalign
to dev
Output JSON files from both
sierra-local
andsierrapy
:output_files.zip
Main differences:
1) Mutation has extra
isApobecMutation
,isUnusual
,isSDRM
,hasStop
,primaryType
andtext
fieldslevel
fieldSDRMs
,alignedNAs
,alignedAAs
,prettyPairwise
fields