Closed erikyao closed 8 months ago
@erikyao thanks for the detailed info.
For the first issue, yes, I agree, it's a bug that should be fixed.
For the second issue, I think these issues are related to the same variant on dna sequence, but multiple protein sequence it can be translated to. Yes, we should merge them ideally, but we need to decide how we want to construct the merged fields. Previously, we skipped that part except for the aa
field. I think you can proceed with a merging strategy and we can then go over the merged doc to see if anything may cause confusion.
Current Merging Strategy
Current merging strategy is like:
_id
, they are mergeable.aa
field will be merged, other fields are carried over from the very first document in the merging process.A bug w.r.t. Step 1
Current code (dbnsfp_parser.py#L648) has a bug that, if the
previous_row
is valid butcurrent_row
is not (when missingpos_hg19
field forhg19
, see dbnsfp_parser.py#L335), theprevious_row
is discarded from generation.E.g. the 2779th and 2780th row of
dbNSFP4.4a_variant.chrM.gz
are like:The 2779th row will be discarded when running the current parser.
Correct behavior should be keeping
previous_row
, proceeding to nextcurrent_row
, and checking if they are mergeable. If not, yieldprevious_row
.This bug is to be fixed in along with the implementation of https://github.com/biothings/myvariant.info/issues/172.
Plus, the validity check should also apply to
pos(1-based)
field forhg38
. Previously we only checkpos_hg19
because we assumed that whenpos_hg19
is missing,pos(1-based)
would also be missing because it's generated with LiftOver. It's better to check the real data instead of holding such an assumption. This validity check will also be implemented with https://github.com/biothings/myvariant.info/issues/172.Question towards Step 2
Currently only the
aa
field will be merged (see dbnsfp_parser.py#L629); however other fields can be different but we just ignored them. E.g. below are two real documents generated fromdbNSFP4.4a_variant.chrM.gz
, before merging:Click to expand doc No.1
```python { "_id": "chrMT:g.8528A>C", "dbnsfp": { "aa": { "alt": "N", "codon_degeneracy": 2, "codonpos": 3, "pos": 54, "ref": "K", "refcodon": "AAA" }, "alt": "C", "ancestral_allele": "A", "appris": "principal1", "bayesdel": { "add_af": { "pred": "T", "rankscore": 0.39171, "score": -0.0833382 }, "no_af": { "pred": "T", "rankscore": 0.38343, "score": -0.357486 } }, "cds_strand": "+", "chrom": "MT", "deogen2": { "pred": "T", "rankscore": 0.75038, "score": 0.390127 }, "ensembl": { "geneid": "ENSG00000228253", "proteinid": "ENSP00000355265", "transcriptid": "ENST00000361851" }, "fathmm": { "converted": { "rankscore": 0.27032 }, "pred": "T", "score": 1.69 }, "genecode_basic": "Y", "genename": "MT-ATP8", "gerp++": { "nr": 5.07, "rs": 3.69, "rs_rankscore": 0.41483 }, "hg18": { "end": 8528, "start": 8528 }, "hg19": { "end": 8528, "start": 8528 }, "hg38": { "end": 8527, "start": 8527 }, "hgvsc": "c.162A>C", "hgvsp": "p.Lys54Asn", "list-s2": { "pred": "D", "rankscore": 0.9202, "score": 0.977202 }, "mutationtaster": { "aae": "K54N", "converted": { "rankscore": 0.37323 }, "model": "simple_aae", "pred": "D", "score": 0.941143 }, "phastcons": { "100way_vertebrate": { "rankscore": 0.71638, "score": 1 }, "17way_primate": { "rankscore": 0.14061, "score": 0.038 }, "470way_mammalian": { "rankscore": 0.28948, "score": 0.948 } }, "phylop": { "100way_vertebrate": { "rankscore": 0.6969, "score": 5.913 }, "17way_primate": { "rankscore": 0.7064, "score": 0.673 }, "470way_mammalian": { "rankscore": 0.22764, "score": 0.929 } }, "provean": { "converted": { "rankscore": 0.79143 }, "pred": "D", "score": -4.62 }, "ref": "A", "sift": { "converted": { "rankscore": 0.91255 }, "pred": "D", "score": 0 }, "sift4g": { "converted": { "rankscore": 0.83351 }, "pred": "D", "score": 0.001 }, "uniprot": { "acc": "P03928", "entry": "ATP8_HUMAN" }, "varity_er": { "rankscore": 0.96288, "score": 0.9184425 }, "varity_er_loo": { "rankscore": 0.96289, "score": 0.9184425 }, "varity_r": { "rankscore": 0.96524, "score": 0.9525343 }, "varity_r_loo": { "rankscore": 0.96524, "score": 0.9525343 }, "vep_canonical": "YES" } } ```Click to expand doc No.2
```python { "_id": "chrMT:g.8528A>C", "dbnsfp": { "aa": { "alt": "L", "codon_degeneracy": 0, "codonpos": 1, "pos": 1, "ref": "M", "refcodon": "ATG" }, "alt": "C", "ancestral_allele": "A", "appris": "principal1", "bayesdel": { "add_af": { "pred": "T", "rankscore": 0.39171, "score": -0.0833382 }, "no_af": { "pred": "T", "rankscore": 0.38343, "score": -0.357486 } }, "cds_strand": "+", "chrom": "MT", "deogen2": { "pred": "T", "rankscore": 0.60073, "score": 0.233842 }, "ensembl": { "geneid": "ENSG00000198899", "proteinid": "ENSP00000354632", "transcriptid": "ENST00000361899" }, "fathmm": { "converted": { "rankscore": 0.01516 }, "pred": "T", "score": 4.81 }, "genecode_basic": "Y", "genename": "MT-ATP6", "gerp++": { "nr": 5.07, "rs": 3.69, "rs_rankscore": 0.41483 }, "hg18": { "end": 8528, "start": 8528 }, "hg19": { "end": 8528, "start": 8528 }, "hg38": { "end": 8527, "start": 8527 }, "hgvsc": "c.1A>C", "hgvsp": "p.Met1?", "list-s2": { "pred": "D", "rankscore": 0.65058, "score": 0.9 }, "mutationtaster": { "aae": "K54N", "converted": { "rankscore": 0.37323 }, "model": "simple_aae", "pred": "D", "score": 0.941143 }, "phastcons": { "100way_vertebrate": { "rankscore": 0.71638, "score": 1 }, "17way_primate": { "rankscore": 0.14061, "score": 0.038 }, "470way_mammalian": { "rankscore": 0.28948, "score": 0.948 } }, "phylop": { "100way_vertebrate": { "rankscore": 0.6969, "score": 5.913 }, "17way_primate": { "rankscore": 0.7064, "score": 0.673 }, "470way_mammalian": { "rankscore": 0.22764, "score": 0.929 } }, "provean": { "converted": { "rankscore": 0.54059 }, "pred": "N", "score": -2.48 }, "ref": "A", "sift4g": { "converted": { "rankscore": 0.6789 }, "pred": "D", "score": 0.008 }, "uniprot": { "acc": "P00846", "entry": "ATP6_HUMAN" }, "varity_er": { "rankscore": 0.85876, "score": 0.7609408 }, "varity_er_loo": { "rankscore": 0.85877, "score": 0.7609408 }, "varity_r": { "rankscore": 0.84131, "score": 0.8035125 }, "varity_r_loo": { "rankscore": 0.84133, "score": 0.8035125 }, "vep_canonical": "YES" } } ```Their diff is:![www jsondiff com_](https://github.com/biothings/myvariant.info/assets/4813680/eefd76d7-8e29-4bfd-acd4-623f6ac3f784)
Shall we merge these two documents into one like this?