Closed matren395 closed 1 year ago
also logging this here -- Alex has clarified that the test dataset should include primarily common variants (the bulk of the dataset should be common variants, and our edge cases can be a smaller proportion of the dataset). also, because he wants mostly common variants, we should include rare variants in our edge cases
https://github.com/broadinstitute/tgg_methods/pull/63#issuecomment-1380798407
Do they have an AF threshold for what they define as common?
apologies for the confusion, I clarified with Alex, and he meant common variant types, not common based on allele frequency -- our current approach should be exactly what his team needs
slack thread: https://the-tgg.slack.com/archives/C0479HSSVQU/p1673619549525149?thread_ts=1673588983.192039&cid=C0479HSSVQU
Also, the v3 HT has been repartitioned! https://github.com/broadinstitute/gnomad_production/issues/568#issuecomment-1382002057
Just a general note for the future - it would be helpful to use more descriptive names for both branch and script names so that someone looking at the branch/script has a general idea of the content. For branch names its useful to include your username or initials to help organize the branches and make it easier for multiple users of the same repo to find their own branches (so something like "marten395/vrs_test" would be good, but personal info typically won't be in the script name itself (something like "create_test_dataset.py" would be good)
Hopefully this last round of commits should be the last and be ready to merge ?
Added the dataset construction script used to construct the test dataset for testing the VRS Script. Includes 100k variants, approximately 50k randomly samples variants, 10k additional in-dels, 10k additional long reference (>3bp), 10k additional long variant (>3bp), 10k additional sex (9k X, 1k Y), and 10k very multiallelic variants. NOTE: Additional erroneous/intentionally wrong variants to test the VRS script are not included yet, but coming soon hopefully!