biothings / myvariant.info

MyVariant.info: A BioThings API for human variant annotations
http://myvariant.info
Other
86 stars 32 forks source link

dbNSFP document-merging problems #173

Closed erikyao closed 8 months ago

erikyao commented 10 months ago

Current Merging Strategy

Current merging strategy is like:

  1. During document generation, if adjacent documents (can be more than 2 in some cases) have the same _id, they are mergeable.
  2. Only the aa field will be merged, other fields are carried over from the very first document in the merging process.

A bug w.r.t. Step 1

Current code (dbnsfp_parser.py#L648) has a bug that, if the previous_row is valid but current_row is not (when missing pos_hg19 field for hg19, see dbnsfp_parser.py#L335), the previous_row is discarded from generation.

E.g. the 2779th and 2780th row of dbNSFP4.4a_variant.chrM.gz are like:

#chr pos(1-based) ref alt aaref aaalt rs_dbSNP hg19_chr hg19_pos(1-based)
M 4768 T G I R . M 4769
M 4769 A G I M rs3021086 . .

The 2779th row will be discarded when running the current parser.

Correct behavior should be keeping previous_row, proceeding to next current_row, and checking if they are mergeable. If not, yield previous_row.

This bug is to be fixed in along with the implementation of https://github.com/biothings/myvariant.info/issues/172.

Plus, the validity check should also apply to pos(1-based) field for hg38. Previously we only check pos_hg19 because we assumed that when pos_hg19 is missing, pos(1-based) would also be missing because it's generated with LiftOver. It's better to check the real data instead of holding such an assumption. This validity check will also be implemented with https://github.com/biothings/myvariant.info/issues/172.

Question towards Step 2

Currently only the aa field will be merged (see dbnsfp_parser.py#L629); however other fields can be different but we just ignored them. E.g. below are two real documents generated from dbNSFP4.4a_variant.chrM.gz, before merging:

Click to expand doc No.1 ```python { "_id": "chrMT:g.8528A>C", "dbnsfp": { "aa": { "alt": "N", "codon_degeneracy": 2, "codonpos": 3, "pos": 54, "ref": "K", "refcodon": "AAA" }, "alt": "C", "ancestral_allele": "A", "appris": "principal1", "bayesdel": { "add_af": { "pred": "T", "rankscore": 0.39171, "score": -0.0833382 }, "no_af": { "pred": "T", "rankscore": 0.38343, "score": -0.357486 } }, "cds_strand": "+", "chrom": "MT", "deogen2": { "pred": "T", "rankscore": 0.75038, "score": 0.390127 }, "ensembl": { "geneid": "ENSG00000228253", "proteinid": "ENSP00000355265", "transcriptid": "ENST00000361851" }, "fathmm": { "converted": { "rankscore": 0.27032 }, "pred": "T", "score": 1.69 }, "genecode_basic": "Y", "genename": "MT-ATP8", "gerp++": { "nr": 5.07, "rs": 3.69, "rs_rankscore": 0.41483 }, "hg18": { "end": 8528, "start": 8528 }, "hg19": { "end": 8528, "start": 8528 }, "hg38": { "end": 8527, "start": 8527 }, "hgvsc": "c.162A>C", "hgvsp": "p.Lys54Asn", "list-s2": { "pred": "D", "rankscore": 0.9202, "score": 0.977202 }, "mutationtaster": { "aae": "K54N", "converted": { "rankscore": 0.37323 }, "model": "simple_aae", "pred": "D", "score": 0.941143 }, "phastcons": { "100way_vertebrate": { "rankscore": 0.71638, "score": 1 }, "17way_primate": { "rankscore": 0.14061, "score": 0.038 }, "470way_mammalian": { "rankscore": 0.28948, "score": 0.948 } }, "phylop": { "100way_vertebrate": { "rankscore": 0.6969, "score": 5.913 }, "17way_primate": { "rankscore": 0.7064, "score": 0.673 }, "470way_mammalian": { "rankscore": 0.22764, "score": 0.929 } }, "provean": { "converted": { "rankscore": 0.79143 }, "pred": "D", "score": -4.62 }, "ref": "A", "sift": { "converted": { "rankscore": 0.91255 }, "pred": "D", "score": 0 }, "sift4g": { "converted": { "rankscore": 0.83351 }, "pred": "D", "score": 0.001 }, "uniprot": { "acc": "P03928", "entry": "ATP8_HUMAN" }, "varity_er": { "rankscore": 0.96288, "score": 0.9184425 }, "varity_er_loo": { "rankscore": 0.96289, "score": 0.9184425 }, "varity_r": { "rankscore": 0.96524, "score": 0.9525343 }, "varity_r_loo": { "rankscore": 0.96524, "score": 0.9525343 }, "vep_canonical": "YES" } } ```
Click to expand doc No.2 ```python { "_id": "chrMT:g.8528A>C", "dbnsfp": { "aa": { "alt": "L", "codon_degeneracy": 0, "codonpos": 1, "pos": 1, "ref": "M", "refcodon": "ATG" }, "alt": "C", "ancestral_allele": "A", "appris": "principal1", "bayesdel": { "add_af": { "pred": "T", "rankscore": 0.39171, "score": -0.0833382 }, "no_af": { "pred": "T", "rankscore": 0.38343, "score": -0.357486 } }, "cds_strand": "+", "chrom": "MT", "deogen2": { "pred": "T", "rankscore": 0.60073, "score": 0.233842 }, "ensembl": { "geneid": "ENSG00000198899", "proteinid": "ENSP00000354632", "transcriptid": "ENST00000361899" }, "fathmm": { "converted": { "rankscore": 0.01516 }, "pred": "T", "score": 4.81 }, "genecode_basic": "Y", "genename": "MT-ATP6", "gerp++": { "nr": 5.07, "rs": 3.69, "rs_rankscore": 0.41483 }, "hg18": { "end": 8528, "start": 8528 }, "hg19": { "end": 8528, "start": 8528 }, "hg38": { "end": 8527, "start": 8527 }, "hgvsc": "c.1A>C", "hgvsp": "p.Met1?", "list-s2": { "pred": "D", "rankscore": 0.65058, "score": 0.9 }, "mutationtaster": { "aae": "K54N", "converted": { "rankscore": 0.37323 }, "model": "simple_aae", "pred": "D", "score": 0.941143 }, "phastcons": { "100way_vertebrate": { "rankscore": 0.71638, "score": 1 }, "17way_primate": { "rankscore": 0.14061, "score": 0.038 }, "470way_mammalian": { "rankscore": 0.28948, "score": 0.948 } }, "phylop": { "100way_vertebrate": { "rankscore": 0.6969, "score": 5.913 }, "17way_primate": { "rankscore": 0.7064, "score": 0.673 }, "470way_mammalian": { "rankscore": 0.22764, "score": 0.929 } }, "provean": { "converted": { "rankscore": 0.54059 }, "pred": "N", "score": -2.48 }, "ref": "A", "sift4g": { "converted": { "rankscore": 0.6789 }, "pred": "D", "score": 0.008 }, "uniprot": { "acc": "P00846", "entry": "ATP6_HUMAN" }, "varity_er": { "rankscore": 0.85876, "score": 0.7609408 }, "varity_er_loo": { "rankscore": 0.85877, "score": 0.7609408 }, "varity_r": { "rankscore": 0.84131, "score": 0.8035125 }, "varity_r_loo": { "rankscore": 0.84133, "score": 0.8035125 }, "vep_canonical": "YES" } } ```

Their diff is: www jsondiff com_

Shall we merge these two documents into one like this?

{
    "_id": "chrMT:g.8528A>C",
    "dbnsfp": [
        {
            "aa": { "blahblah" },
            "alt": "C",
            "ancestral_allele": "A",
            "appris": "principal1",
            # omitted
        },
        {
            "aa": { "blahblah" }
            "alt": "C",
            "ancestral_allele": "A",
            "appris": "principal1",
            # omitted
        }
    ]
}
newgene commented 10 months ago

@erikyao thanks for the detailed info.

erikyao commented 8 months ago

Fixed after https://github.com/biothings/myvariant.info/pull/178