genetic-tools / primerbench

0 stars 0 forks source link

Correctly handle MAFs from multiple sources #17

Open GaretJax opened 6 years ago

GaretJax commented 6 years ago

Some variants have different MAFs for the same population coming from different databases (for example https://genetic.tools/labs/test/primers/98/forward/21890486...21890505/).

What's the correct way to handle these? Just consider the maximum one?

/cc @AnCaTjin

AnCaTjin commented 6 years ago

Sorry for the late reply, I canceled the notification for new postes and was on a congress last week.

That is a really good question. The safest way would obviously be to consider the maximum one, but I think one has to consider the amount of data in the different databases. I would assume, the more data the safer the predictated frequency. I would suggest to use the maximum one for now. It is the safest solution and the information from other databases is still there, so nothing is lost.

GaretJax commented 6 years ago

No problem. Beat told me so. ;-)

I am working on custom checks on lab/primer basis; would an option to select which databases to use (or the order of preference) be something which might solve this issue?

AnCaTjin commented 6 years ago

I like the idea. So everyone can choose there one prefered database. As long as it is not to complicated to program and to much choice for the user to overwhelm.

AnCaTjin commented 6 years ago

maf_2 maf_1

AnCaTjin commented 6 years ago

Are you using only 1000Genomes now as a MAF source. The example above is a new primer, the lower picture shows no MAFs but when I followed the link to gnomAD (upper picture) there is a MAF (only a very low one, but still).

GaretJax commented 6 years ago

No, we're still using everything that Ensembl sends us. Sadly their database is not up to date with regard to this primer.

The strange thing is that here http://grch37.ensembl.org/Homo_sapiens/Variation/Population?db=core;r=17:78185858-78186858;v=rs775329488;vdb=variation;vf=141108533 it is showing the population frequencies, but over their API that data is not returned:

[
    {
        "allele_string": "C/T",
        "assembly_name": "GRCh37",
        "colocated_variants": [
            {
                "allele_string": "C/T",
                "end": 78186358,
                "id": "rs775329488",
                "seq_region_name": 17,
                "start": 78186358,
                "strand": 1
            }
        ],
        "end": 78186358,
        "id": "rs775329488",
        "input": "rs775329488",
        "most_severe_consequence": "non_coding_transcript_exon_variant",
        "seq_region_name": "17",
        "start": 78186358,
        "strand": 1,
        "transcript_consequences": [
            {
                "biotype": "protein_coding",
                "consequence_terms": [
                    "intron_variant"
                ],
                "gene_id": "ENSG00000181523",
                "gene_symbol": "SGSH",
                "gene_symbol_source": "HGNC",
                "hgnc_id": 10818,
                "impact": "MODIFIER",
                "strand": -1,
                "transcript_id": "ENST00000326317",
                "variant_allele": "T"
            },
            ...
        ]
    }
]

I'll have to investigate if they changed their API anyhow, but so far it looks like that for other variants that information is still available.

GaretJax commented 6 years ago

It looks like that the MAF is very low for that primer, that might be the reason that the data is not included in the API response...

AnCaTjin commented 6 years ago

I know the MAF is very low, it just appeared while comparing results form the tool and the way that acts as a makeshift (searching the primer in Alamut) right now. And there the SNP was displayed in gnomAD with a very low frequency. It is strange so the frequency appears on the ensembl site but not in the API response. I hope this is not because of unsupported archive sites...

GaretJax commented 6 years ago

The first part of this (support and display exact information for all available databases) has been implemented. The second part comes next (letting the user choose which databases he/she would like to query).

AnCaTjin commented 6 years ago

maf_gnomad

I experience still some problems with the MAFs (example above). red arrow: the MAF of this variant is depicted as n/a in genetic tools and as 0.0003578 in gnomAD, is this only a mistake in display? Only a certain numer of decimal points are displayed? black arrow: gnomAD seems to have a problem with the rs-numbers most of the links result an error.

GaretJax commented 6 years ago

That's what is returned by the Ensembl API (no frequencies).

I still use GRCh37 by default on all primers, I'll work on supporting both (user selectable, GRCh38 by default) soon, it should be really easy. Maybe for GRCh38 we will have the frequencies.

For gnomAD, it's a known issue, the current implementation is overly simplistic and external references will be reworked.

AnCaTjin commented 6 years ago

I just find it confusing, as gnomAD, NHLBI, and 1000Genomes are separately listed and no MAF is listed for them.

GaretJax commented 6 years ago

Hi Ann-Kathrin, I just release an improvement over the previous way we were linking to gnomAD. There are still issues but I think it's mostly because the variants are really not know to gnomAD.

AnCaTjin commented 6 years ago

Hi Jonathan, I see the link is now working but there are still no MAFs displayed. What do you mean by "the variants are really not know to gnomAD"? I can see the variant in the gnomAD browser (see picture above).

GaretJax commented 6 years ago

There are two issues here:

  1. The links not working. That is fixed for variants known to gnomAD (this is what I was referring to above).
  2. The missing MAFs. This will not be fixed, as we rely on Ensembl to provide those, and they are not doing so. Presumably because gnomAD does not have the dnSNP rs* identifier on file (as you can see on your screenshot above).

I'll reach out to Ensembl asking if this is intended or something they can fix on their side, but more than that we will not be able to do.

AnCaTjin commented 6 years ago

I see. But could this be a problem only of Ensembl GRCh37? Becouse in the GRCh38 version the variant is known in Ensembl. Unfortunately I can not reach the GRCh37 version right now.

GaretJax commented 6 years ago

It's a problem in both GRCh37 and GRCh38, but interestingly, only for the API. The web view shoes the expected results.

I wrote them, let's see what they answer.

AnCaTjin commented 6 years ago

Hi Jonathan! What did Ensembl say?