Using latest 1k Genome dataset for building model

dbrami commented 2 years ago

Hi, I'm exploring adding an ethnic background to sample QC pipeline I'm working on and this tool seems to check all the boxes I was involved in making re-analysis of the latest addition of samples for the 1k Genome project available using Illumina DRAGEN available on AWS.

DRAGEN reanalysis of the 1000 Genomes Dataset now available on the Registry of Open Data

Although I can't find similar complete aggregate bed file as the ones pulled by your fetch script, do you think your pipeline can easily be modified to create a model using this updated data set? Would love to hear your thoughts on cost/benefit of this approach.

Thanks, Daniel Brami

arvkevi commented 2 years ago

Hi Daniel, Thanks for checking out the repo. There are two things to consider when training a new model from this Dragen dataset.

I wrote parsing functionality to process an integrated callset from bcf files. We'd have to write some functionality to generate an integrated callset for the Dragen-1kg data (unless it exists somewhere, but I couldn't quickly find it). The HG* vcf files stored in S3 are too large to store locally on my machine, all at once, so if there was a way to filter them to AISNP loci in a stream, that could work as a method to start to build an integrated callset.
The two AISNPs are currently mapped to hg19, so we'd need to lift them over to grch38.
hmm, thinking more about number 2, ezancestry just assumes hg19 across training data, AISNPs, and any predictions made against the ezancestry models. I should more explicitly call this out in the documentation.

Assuming 1 and 2 are possible, I see a path forward: I'd need to rewrite this function to parse Dragen-1kg data. Which begs the question, should it be even more flexible and handle any training data.

Let me know what you think, I'm curious to hear your thoughts on how to solve number 1 above. Number 2 should be straightforward enough.

Kevin

dbrami commented 2 years ago

Hi Kevin, Thanks for getting back to me so promptly! Before you rack your brain too much, I am waiting to hear back from my former colleagues at Illumina. I'm hoping they will be able to point me to the right files needed for this in the S3 bucket or perhaps to some other repo. Will get back to you as soon as possible. Cheers, Daniel

dbrami commented 2 years ago

Hi Kevin, I just heard back from them. They are asking for what kind of files we would want for our analysis. I'm at a loss here. My naive response would be a joint genotyping file inclusive of all 3202 genomes, perhaps broken down by chromosome, in BED format. but not sure that's correct. Can you guide me?

arvkevi commented 2 years ago

The best file type would be a multi sample vcf. Each record in the vcf would have genotypes listed for each of the 3,202 genomes at that locus. Breaking these files out by chromosome would work well. Do you know if they will host the files?

However, going back to this comment:

Would love to hear your thoughts on cost/benefit of this approach.

If I had the files above, before going through the effort of training a model on these reanalyzed genomes, I would compare the genotypes at the 128 AISNP locations between the original call set and the reanalyzed call set. If there is little variance between calls in the original and the calls in the reanalyzed calls at those AISNP loci, then I don't see any benefit to training a model on the new data.

An interesting quasi-experiment would be to run the ezancestry predict command on (at least a subset) of the reanalyzed vcf's and check the performance of the ancestry labels: ezancestry predict /path/to/some/HG0000*.vcf.gz

If the performance is good, then you might not need to retrain a model at all.

dbrami commented 2 years ago

Hi Kevin, I'm thinking about your last comment Given the new data can easily be queried using AWS Athena for a few bucks as per this article: Perform interactive queries on your genomics data using Amazon Athena or Amazon Redshift (Part 2) What fields should I be retrieving with a query, and how would I compare the results with values used to build model? (I'm assuming i'm comparing allele frequencies, not running a test vcf on pipeline). It should be pretty straightforward to find the corresponding RSID's for the reference genome used in the analysis, and use these in aiSNP query. Cheers, Daniel

arvkevi commented 2 years ago

I tried a couple of queries with what I thought would be the minimum fields needed for comparison. However, I had limited success. Here's what I did:

LIfted over the AISNPs in this file from hg19 to hg38.
Built this stack on AWS (a prerequisite for querying Dragen via Athena).
Executed the queries below:

SELECT 
  * 
FROM 
  var_nested 
WHERE 
  (chrom, pos) IN (
    ('chr1', 101709563), 
    ('chr1', 151122489), 
    ('chr1', 159174683), 
    ('chr2', 7968275), 
    ('chr2', 17362568), 
    ('chr2', 17901485), 
    ('chr2', 109513601), 
    ('chr2', 109579738), 
    ('chr2', 136707982), 
    ('chr2', 158667217), 
    ('chr3', 121459589), 
    ('chr4', 38815502), 
    ('chr4', 100239319), 
    ('chr4', 100244319), 
    ('chr4', 105375423), 
    ('chr5', 33951693), 
    ('chr5', 170202984), 
    ('chr5', 6845035), 
    ('chr6', 136482727), 
    ('chr6', 90518278), 
    ('chr7', 28172586), 
    ('chr8', 31896592), 
    ('chr8', 110602317), 
    ('chr8', 122124302), 
    ('chr8', 145639681), 
    ('chr9', 127267689), 
    ('chr10', 94921065), 
    ('chr11', 61597212), 
    ('chr11', 113296286), 
    ('chr12', 112211833), 
    ('chr12', 112241766), 
    ('chr13', 34847737), 
    ('chr13', 41715282), 
    ('chr13', 42579985), 
    ('chr13', 49070512), 
    ('chr13', 111827167), 
    ('chr14', 99375321), 
    ('chr15', 28197037), 
    ('chr15', 28365618), 
    ('chr15', 36220035), 
    ('chr15', 45152371), 
    ('chr15', 48426484), 
    ('chr16', 89730827), 
    ('chr17', 40658533), 
    ('chr17', 41056245), 
    ('chr17', 48726132), 
    ('chr17', 53568884), 
    ('chr17', 62987151), 
    ('chr18', 35277622), 
    ('chr18', 40488279), 
    ('chr18', 67578931), 
    ('chr18', 67867663), 
    ('chr19', 4077096), 
    ('chr20', 62159504), 
    ('chr22', 41697338)
  )

This returned two records (note that I've truncated the gts field here):

#   variant_id  pos ref alt samples chrom
1   chr5:G:A:170202984  170202984   G   A   [{id=HG02490, gts=[0, 1]}, ...] chr5
2   chr8:G:C:31896592   31896592    G   C   [{id=NA19374, gts=[0, 1]}, ...] chr8

Then I thought to try querying by rsid, which didn't need a liftover...

SELECT 
  chrom, 
  pos, 
  "gt.alleles" 
FROM 
  var_partby_samples 
WHERE 
  rsid IN (
    'rs3737576', 'rs7554936', 'rs2814778', 
    'rs798443', 'rs1876482', 'rs1834619', 
    'rs3827760', 'rs260690', 'rs6754311', 
    'rs10497191', 'rs12498138', 'rs4833103', 
    'rs1229984', 'rs3811801', 'rs7657799', 
    'rs16891982', 'rs7722456', 'rs870347', 
    'rs3823159', 'rs192655', 'rs917115', 
    'rs1462906', 'rs6990312', 'rs2196051', 
    'rs1871534', 'rs3814134', 'rs4918664', 
    'rs174570', 'rs1079597', 'rs2238151', 
    'rs671', 'rs7997709', 'rs1572018', 
    'rs2166624', 'rs7326934', 'rs9522149', 
    'rs200354', 'rs1800414', 'rs12913832', 
    'rs12439433', 'rs735480', 'rs1426654', 
    'rs459920', 'rs4411548', 'rs2593595', 
    'rs17642714', 'rs4471745', 'rs11652805', 
    'rs2042762', 'rs7226659', 'rs3916235', 
    'rs4891825', 'rs7251928', 'rs310644', 
    'rs2024566' 
  )

But this query returned 0 records...

Anything you can think of to modify these queries so they return more results? Depending on how you want to implement ezancestry, I still think you could probably use the pretrained models as-is.

dbrami commented 2 years ago

Hi Kevin, I always look forward to our exchanges :) I have not tried what you have described above yet but I doubt that I will be able to do any better. Your conclusion of using the pre-trained models is sounding more and more like the path forward. That being said, I got a response from my Illumina colleagues this morning:

_"s3://1000genomes-dragen-3.7.6/data/cohorts/gvcf-genotyper-dragen-3.7.6/hg38/3202-samples-cohort/

The folder contains the results of gVCF Genotyping (joint calling) on the entire cohort of 3,202 samples in the NYGC 1KGP data set. The analysis was performed using DRAGEN v3.7.6 and our hg38-graph-based reference hash table. The results are broken up into per-chromosome multisample VCFs (e.g., 3202_samples_cohort_ggchr1.vcf.gz)."

Does this make accessing the needed data for training updated models easier?

arvkevi commented 2 years ago

Hey Daniel, I was able to query Athena, will compare the genotypes this weekend!

SELECT * FROM var_nested
WHERE (chrom, pos) IN 
(('chr1', 101244007),
 ('chr1', 151150013),
 ('chr1', 159204893),
 ('chr2', 7828144),
 ('chr2', 17181301),
 ('chr2', 17720218),
 ('chr2', 108897145),
 ('chr2', 108963282),
 ('chr2', 135950412),
 ('chr2', 157810705),
 ('chr3', 121740742),
 ('chr4', 38813881),
 ('chr4', 99318162),
 ('chr4', 99323162),
 ('chr4', 104454266),
 ('chr5', 33951588),
 ('chr5', 170775980),
 ('chr5', 6844922),
 ('chr6', 136161589),
 ('chr6', 89808559),
 ('chr7', 28132967),
 ('chr8', 32039076),
 ('chr8', 109590088),
 ('chr8', 121112062),
 ('chr8', 144414297),
 ('chr9', 124505410),
 ('chr10', 93161308),
 ('chr11', 61829740),
 ('chr11', 113425564),
 ('chr12', 111774029),
 ('chr12', 111803962),
 ('chr13', 34273600),
 ('chr13', 41141146),
 ('chr13', 42005849),
 ('chr13', 48496376),
 ('chr13', 111174820),
 ('chr14', 98908984),
 ('chr15', 27951891),
 ('chr15', 28120472),
 ('chr15', 35927834),
 ('chr15', 44860173),
 ('chr15', 48134287),
 ('chr16', 89664419),
 ('chr17', 42506515),
 ('chr17', 42904228),
 ('chr17', 50648771),
 ('chr17', 55491523),
 ('chr17', 64991033),
 ('chr18', 37697659),
 ('chr18', 42908314),
 ('chr18', 69911695),
 ('chr18', 70200427),
 ('chr19', 4077098),
 ('chr20', 63528151),
 ('chr22', 41301334))

dbrami commented 2 years ago

Thanks for the update! I hope this exploration opens up so exciting options for either this projects or others you are working :) I was going to say looks like a simpler query, but in fact, it is not as you had to go from using RSID's to the HG38 coordinates..

dbrami commented 2 years ago

Hi Kevin, Did you have a chance to compare the genotypes? Is there a difference in the frequencies?

arvkevi commented 2 years ago

Hi @dbrami -- sorry for the delayed response, I came back to look at this and am finally getting somewhere. See this notebook on the kevin/dragen branch for initial results (and some questions at the end)

dbrami commented 2 years ago

Hi Kevin, I'm still waiting to hear back from my former Illumina colleagues to confirm but here's what I believe regarding your two last questions in your notebook analysis:

Yes, the genomes in the 3.7.6 analysis are indeed phased. See DRAGEN MANUAL for phasing tag. See end of response for excerpt from a random VCF file.
I did not check, but I do believe that 0|0 is not stored in VCF therefore NaN, but this is probably different in the gVCF file
It is interesting that some variant calls are legit different between both original and DRAGEN. That said, I would rely on the 3.7 DRAGEN ones only...

Let me know if there's a next step, ie updating the models is warranted at some point :)

`(dnafinger) (base) ➜ VCFs head -n 2000 HG01082.hard-filtered.vcf | grep ':PS' | head

chrM 146 . T C . PASS DP=9298;MQ=219.97;LOD=32665.90;FractionInformativeReads=0.991 GT:SQ:AD:AF:F1R2:F2R1:DP:SB:MB:PS 0|1:32665.90:0,9212:1.000:0,3975:0,5237:9212:0,0,3775,5437:0,0,4154,5058:146 chrM 150 . T C . PASS DP=9106;MQ=221.57;LOD=31053.49;FractionInformativeReads=0.991 GT:SQ:AD:AF:F1R2:F2R1:DP:SB:MB:PS 0|1:31053.49:0,9024:1.000:0,3886:0,5138:9024:0,0,3660,5364:0,0,4030,4994:146 chrM 153 . A G . PASS DP=8984;MQ=223.60;LOD=31644.45;FractionInformativeReads=0.990 GT:SQ:AD:AF:F1R2:F2R1:DP:SB:MB:PS 0|1:31644.45:0,8897:1.000:0,3816:0,5081:8897:0,0,3640,5257:0,0,3925,4972:146 chrM 195 . C T . PASS DP=8701;MQ=239.17;LOD=29563.56;FractionInformativeReads=0.988 GT:SQ:AD:AF:F1R2:F2R1:DP:SB:MB:PS 0|1:29563.56:1,8596:1.000:1,3485:0,5111:8597:0,1,3390,5206:0,1,3403,5193:146 chrM 410 . A T . PASS DP=5423;MQ=245.24;LOD=20678.27;FractionInformativeReads=0.991 GT:SQ:AD:AF:F1R2:F2R1:DP:SB:MB:PS 0|1:20678.27:0,5375:1.000:0,975:0,4400:5375:0,0,1990,3385:0,0,2632,2743:410 chrM 501 . G A . PASS DP=6514;MQ=248.01;LOD=105.66;FractionInformativeReads=0.980 GT:SQ:AD:AF:F1R2:F2R1:DP:SB:MB:PS 0|1:105.66:6276,108:0.017:1137,21:5139,87:6384:3538,2738,59,49:3513,2763,64,44:410 chrM 515 . GCA G . PASS DP=6949;MQ=243.03;LOD=21011.27;FractionInformativeReads=0.943 GT:SQ:AD:AF:F1R2:F2R1:DP:SB:MB:PS 0|1:21011.27:63,6488:0.990:10,1398:53,5090:6551:52,11,3760,2728:7,56,3698,2790:410 chrM 541 . T A . base_quality DP=7509;MQ=232.66;LOD=44.62;FractionInformativeReads=0.961 GT:SQ:AD:AF:F1R2:F2R1:DP:SB:MB:PS 0|1:44.62:6828,390:0.054:1748,138:5080,252:7218:3905,2923,390,0:4087,2741,138,252:410 chrM 659 . G A . PASS DP=9200;MQ=179.98;LOD=10.74;FractionInformativeReads=0.977 GT:SQ:AD:AF:F1R2:F2R1:DP:SB:MB:PS 0|1:10.74:8966,22:0.002:3544,9:5422,13:8988:5605,3361,13,9:5027,3939,14,8:659 chrM 665 . A G . PASS DP=9260;MQ=179.73;LOD=31240.23;FractionInformativeReads=0.979 GT:SQ:AD:AF:F1R2:F2R1:DP:SB:MB:PS 0|1:31240.23:0,9061:1.000:0,3628:0,5433:9061:0,0,5652,3409:0,0,5077,3984:659`

arvkevi commented 2 years ago

Hi @dbrami I will assume that NaN's are 0|0's and train model on the Dragen data then compare the performance of the two models with 5-fold CV. I'll try to do that this weekend.

arvkevi commented 2 years ago

Hey @dbrami I've updated the analysis here. The 5-fold cv performance (log loss) is better when using the 1kG data compared to Dragen. It looks like the Dragen snps had less variation at several sites.

You should be able to run the notebook, I added the annotated Dragen AISNPs data file. I'll merge the analysis. https://github.com/arvkevi/ezancestry/pull/59

arvkevi / ezancestry

Using latest 1k Genome dataset for building model #54