Open dbrami opened 2 years ago
Hi Daniel, Thanks for checking out the repo. There are two things to consider when training a new model from this Dragen dataset.
ezancestry
just assumes hg19 across training data, AISNPs, and any predictions made against the ezancestry
models. I should more explicitly call this out in the documentation.Assuming 1 and 2 are possible, I see a path forward: I'd need to rewrite this function to parse Dragen-1kg data. Which begs the question, should it be even more flexible and handle any training data.
Let me know what you think, I'm curious to hear your thoughts on how to solve number 1 above. Number 2 should be straightforward enough.
Kevin
Hi Kevin, Thanks for getting back to me so promptly! Before you rack your brain too much, I am waiting to hear back from my former colleagues at Illumina. I'm hoping they will be able to point me to the right files needed for this in the S3 bucket or perhaps to some other repo. Will get back to you as soon as possible. Cheers, Daniel
Hi Kevin, I just heard back from them. They are asking for what kind of files we would want for our analysis. I'm at a loss here. My naive response would be a joint genotyping file inclusive of all 3202 genomes, perhaps broken down by chromosome, in BED format. but not sure that's correct. Can you guide me?
The best file type would be a multi sample vcf. Each record in the vcf would have genotypes listed for each of the 3,202 genomes at that locus. Breaking these files out by chromosome would work well. Do you know if they will host the files?
However, going back to this comment:
Would love to hear your thoughts on cost/benefit of this approach.
If I had the files above, before going through the effort of training a model on these reanalyzed genomes, I would compare the genotypes at the 128 AISNP locations between the original call set and the reanalyzed call set. If there is little variance between calls in the original and the calls in the reanalyzed calls at those AISNP loci, then I don't see any benefit to training a model on the new data.
An interesting quasi-experiment would be to run the ezancestry predict
command on (at least a subset) of the reanalyzed vcf's and check the performance of the ancestry labels:
ezancestry predict /path/to/some/HG0000*.vcf.gz
If the performance is good, then you might not need to retrain a model at all.
Hi Kevin, I'm thinking about your last comment Given the new data can easily be queried using AWS Athena for a few bucks as per this article: Perform interactive queries on your genomics data using Amazon Athena or Amazon Redshift (Part 2) What fields should I be retrieving with a query, and how would I compare the results with values used to build model? (I'm assuming i'm comparing allele frequencies, not running a test vcf on pipeline). It should be pretty straightforward to find the corresponding RSID's for the reference genome used in the analysis, and use these in aiSNP query. Cheers, Daniel
I tried a couple of queries with what I thought would be the minimum fields needed for comparison. However, I had limited success. Here's what I did:
SELECT
*
FROM
var_nested
WHERE
(chrom, pos) IN (
('chr1', 101709563),
('chr1', 151122489),
('chr1', 159174683),
('chr2', 7968275),
('chr2', 17362568),
('chr2', 17901485),
('chr2', 109513601),
('chr2', 109579738),
('chr2', 136707982),
('chr2', 158667217),
('chr3', 121459589),
('chr4', 38815502),
('chr4', 100239319),
('chr4', 100244319),
('chr4', 105375423),
('chr5', 33951693),
('chr5', 170202984),
('chr5', 6845035),
('chr6', 136482727),
('chr6', 90518278),
('chr7', 28172586),
('chr8', 31896592),
('chr8', 110602317),
('chr8', 122124302),
('chr8', 145639681),
('chr9', 127267689),
('chr10', 94921065),
('chr11', 61597212),
('chr11', 113296286),
('chr12', 112211833),
('chr12', 112241766),
('chr13', 34847737),
('chr13', 41715282),
('chr13', 42579985),
('chr13', 49070512),
('chr13', 111827167),
('chr14', 99375321),
('chr15', 28197037),
('chr15', 28365618),
('chr15', 36220035),
('chr15', 45152371),
('chr15', 48426484),
('chr16', 89730827),
('chr17', 40658533),
('chr17', 41056245),
('chr17', 48726132),
('chr17', 53568884),
('chr17', 62987151),
('chr18', 35277622),
('chr18', 40488279),
('chr18', 67578931),
('chr18', 67867663),
('chr19', 4077096),
('chr20', 62159504),
('chr22', 41697338)
)
This returned two records (note that I've truncated the gts field here):
# variant_id pos ref alt samples chrom
1 chr5:G:A:170202984 170202984 G A [{id=HG02490, gts=[0, 1]}, ...] chr5
2 chr8:G:C:31896592 31896592 G C [{id=NA19374, gts=[0, 1]}, ...] chr8
Then I thought to try querying by rsid, which didn't need a liftover...
SELECT
chrom,
pos,
"gt.alleles"
FROM
var_partby_samples
WHERE
rsid IN (
'rs3737576', 'rs7554936', 'rs2814778',
'rs798443', 'rs1876482', 'rs1834619',
'rs3827760', 'rs260690', 'rs6754311',
'rs10497191', 'rs12498138', 'rs4833103',
'rs1229984', 'rs3811801', 'rs7657799',
'rs16891982', 'rs7722456', 'rs870347',
'rs3823159', 'rs192655', 'rs917115',
'rs1462906', 'rs6990312', 'rs2196051',
'rs1871534', 'rs3814134', 'rs4918664',
'rs174570', 'rs1079597', 'rs2238151',
'rs671', 'rs7997709', 'rs1572018',
'rs2166624', 'rs7326934', 'rs9522149',
'rs200354', 'rs1800414', 'rs12913832',
'rs12439433', 'rs735480', 'rs1426654',
'rs459920', 'rs4411548', 'rs2593595',
'rs17642714', 'rs4471745', 'rs11652805',
'rs2042762', 'rs7226659', 'rs3916235',
'rs4891825', 'rs7251928', 'rs310644',
'rs2024566'
)
But this query returned 0 records...
Anything you can think of to modify these queries so they return more results? Depending on how you want to implement ezancestry, I still think you could probably use the pretrained models as-is.
Hi Kevin, I always look forward to our exchanges :) I have not tried what you have described above yet but I doubt that I will be able to do any better. Your conclusion of using the pre-trained models is sounding more and more like the path forward. That being said, I got a response from my Illumina colleagues this morning:
_"s3://1000genomes-dragen-3.7.6/data/cohorts/gvcf-genotyper-dragen-3.7.6/hg38/3202-samples-cohort/
The folder contains the results of gVCF Genotyping (joint calling) on the entire cohort of 3,202 samples in the NYGC 1KGP data set. The analysis was performed using DRAGEN v3.7.6 and our hg38-graph-based reference hash table. The results are broken up into per-chromosome multisample VCFs (e.g., 3202_samples_cohort_ggchr1.vcf.gz)."
Does this make accessing the needed data for training updated models easier?
Hey Daniel, I was able to query Athena, will compare the genotypes this weekend!
SELECT * FROM var_nested
WHERE (chrom, pos) IN
(('chr1', 101244007),
('chr1', 151150013),
('chr1', 159204893),
('chr2', 7828144),
('chr2', 17181301),
('chr2', 17720218),
('chr2', 108897145),
('chr2', 108963282),
('chr2', 135950412),
('chr2', 157810705),
('chr3', 121740742),
('chr4', 38813881),
('chr4', 99318162),
('chr4', 99323162),
('chr4', 104454266),
('chr5', 33951588),
('chr5', 170775980),
('chr5', 6844922),
('chr6', 136161589),
('chr6', 89808559),
('chr7', 28132967),
('chr8', 32039076),
('chr8', 109590088),
('chr8', 121112062),
('chr8', 144414297),
('chr9', 124505410),
('chr10', 93161308),
('chr11', 61829740),
('chr11', 113425564),
('chr12', 111774029),
('chr12', 111803962),
('chr13', 34273600),
('chr13', 41141146),
('chr13', 42005849),
('chr13', 48496376),
('chr13', 111174820),
('chr14', 98908984),
('chr15', 27951891),
('chr15', 28120472),
('chr15', 35927834),
('chr15', 44860173),
('chr15', 48134287),
('chr16', 89664419),
('chr17', 42506515),
('chr17', 42904228),
('chr17', 50648771),
('chr17', 55491523),
('chr17', 64991033),
('chr18', 37697659),
('chr18', 42908314),
('chr18', 69911695),
('chr18', 70200427),
('chr19', 4077098),
('chr20', 63528151),
('chr22', 41301334))
Thanks for the update! I hope this exploration opens up so exciting options for either this projects or others you are working :) I was going to say looks like a simpler query, but in fact, it is not as you had to go from using RSID's to the HG38 coordinates..
Hi Kevin, Did you have a chance to compare the genotypes? Is there a difference in the frequencies?
Hi @dbrami -- sorry for the delayed response, I came back to look at this and am finally getting somewhere.
See this notebook on the kevin/dragen
branch for initial results (and some questions at the end)
Hi Kevin, I'm still waiting to hear back from my former Illumina colleagues to confirm but here's what I believe regarding your two last questions in your notebook analysis:
Let me know if there's a next step, ie updating the models is warranted at some point :)
`(dnafinger) (base) ➜ VCFs head -n 2000 HG01082.hard-filtered.vcf | grep ':PS' | head
chrM 146 . T C . PASS DP=9298;MQ=219.97;LOD=32665.90;FractionInformativeReads=0.991 GT:SQ:AD:AF:F1R2:F2R1:DP:SB:MB:PS 0|1:32665.90:0,9212:1.000:0,3975:0,5237:9212:0,0,3775,5437:0,0,4154,5058:146 chrM 150 . T C . PASS DP=9106;MQ=221.57;LOD=31053.49;FractionInformativeReads=0.991 GT:SQ:AD:AF:F1R2:F2R1:DP:SB:MB:PS 0|1:31053.49:0,9024:1.000:0,3886:0,5138:9024:0,0,3660,5364:0,0,4030,4994:146 chrM 153 . A G . PASS DP=8984;MQ=223.60;LOD=31644.45;FractionInformativeReads=0.990 GT:SQ:AD:AF:F1R2:F2R1:DP:SB:MB:PS 0|1:31644.45:0,8897:1.000:0,3816:0,5081:8897:0,0,3640,5257:0,0,3925,4972:146 chrM 195 . C T . PASS DP=8701;MQ=239.17;LOD=29563.56;FractionInformativeReads=0.988 GT:SQ:AD:AF:F1R2:F2R1:DP:SB:MB:PS 0|1:29563.56:1,8596:1.000:1,3485:0,5111:8597:0,1,3390,5206:0,1,3403,5193:146 chrM 410 . A T . PASS DP=5423;MQ=245.24;LOD=20678.27;FractionInformativeReads=0.991 GT:SQ:AD:AF:F1R2:F2R1:DP:SB:MB:PS 0|1:20678.27:0,5375:1.000:0,975:0,4400:5375:0,0,1990,3385:0,0,2632,2743:410 chrM 501 . G A . PASS DP=6514;MQ=248.01;LOD=105.66;FractionInformativeReads=0.980 GT:SQ:AD:AF:F1R2:F2R1:DP:SB:MB:PS 0|1:105.66:6276,108:0.017:1137,21:5139,87:6384:3538,2738,59,49:3513,2763,64,44:410 chrM 515 . GCA G . PASS DP=6949;MQ=243.03;LOD=21011.27;FractionInformativeReads=0.943 GT:SQ:AD:AF:F1R2:F2R1:DP:SB:MB:PS 0|1:21011.27:63,6488:0.990:10,1398:53,5090:6551:52,11,3760,2728:7,56,3698,2790:410 chrM 541 . T A . base_quality DP=7509;MQ=232.66;LOD=44.62;FractionInformativeReads=0.961 GT:SQ:AD:AF:F1R2:F2R1:DP:SB:MB:PS 0|1:44.62:6828,390:0.054:1748,138:5080,252:7218:3905,2923,390,0:4087,2741,138,252:410 chrM 659 . G A . PASS DP=9200;MQ=179.98;LOD=10.74;FractionInformativeReads=0.977 GT:SQ:AD:AF:F1R2:F2R1:DP:SB:MB:PS 0|1:10.74:8966,22:0.002:3544,9:5422,13:8988:5605,3361,13,9:5027,3939,14,8:659 chrM 665 . A G . PASS DP=9260;MQ=179.73;LOD=31240.23;FractionInformativeReads=0.979 GT:SQ:AD:AF:F1R2:F2R1:DP:SB:MB:PS 0|1:31240.23:0,9061:1.000:0,3628:0,5433:9061:0,0,5652,3409:0,0,5077,3984:659`
Hi @dbrami I will assume that NaN's are 0|0's and train model on the Dragen data then compare the performance of the two models with 5-fold CV. I'll try to do that this weekend.
Hey @dbrami I've updated the analysis here. The 5-fold cv performance (log loss) is better when using the 1kG data compared to Dragen. It looks like the Dragen snps had less variation at several sites.
You should be able to run the notebook, I added the annotated Dragen AISNPs data file. I'll merge the analysis. https://github.com/arvkevi/ezancestry/pull/59
Hi, I'm exploring adding an ethnic background to sample QC pipeline I'm working on and this tool seems to check all the boxes I was involved in making re-analysis of the latest addition of samples for the 1k Genome project available using Illumina DRAGEN available on AWS.
DRAGEN reanalysis of the 1000 Genomes Dataset now available on the Registry of Open Data
Although I can't find similar complete aggregate bed file as the ones pulled by your fetch script, do you think your pipeline can easily be modified to create a model using this updated data set? Would love to hear your thoughts on cost/benefit of this approach.
Thanks, Daniel Brami