Added Dataset Construction Script and Dataset

broadinstitute / tgg_methods

Repo for miscellaneous methods developed by the methods group that don't fit anywhere else

MIT License

4 stars 0 forks source link

Added Dataset Construction Script and Dataset #63

Closed matren395 closed 1 year ago

matren395 commented 1 year ago

Added the dataset construction script used to construct the test dataset for testing the VRS Script. Includes 100k variants, approximately 50k randomly samples variants, 10k additional in-dels, 10k additional long reference (>3bp), 10k additional long variant (>3bp), 10k additional sex (9k X, 1k Y), and 10k very multiallelic variants. NOTE: Additional erroneous/intentionally wrong variants to test the VRS script are not included yet, but coming soon hopefully!

ch-kr commented 1 year ago

also logging this here -- Alex has clarified that the test dataset should include primarily common variants (the bulk of the dataset should be common variants, and our edge cases can be a smaller proportion of the dataset). also, because he wants mostly common variants, we should include rare variants in our edge cases

klaricch commented 1 year ago

https://github.com/broadinstitute/tgg_methods/pull/63#issuecomment-1380798407

Do they have an AF threshold for what they define as common?

ch-kr commented 1 year ago

apologies for the confusion, I clarified with Alex, and he meant common variant types, not common based on allele frequency -- our current approach should be exactly what his team needs

slack thread: https://the-tgg.slack.com/archives/C0479HSSVQU/p1673619549525149?thread_ts=1673588983.192039&cid=C0479HSSVQU

ch-kr commented 1 year ago

Also, the v3 HT has been repartitioned! https://github.com/broadinstitute/gnomad_production/issues/568#issuecomment-1382002057

klaricch commented 1 year ago

Just a general note for the future - it would be helpful to use more descriptive names for both branch and script names so that someone looking at the branch/script has a general idea of the content. For branch names its useful to include your username or initials to help organize the branches and make it easier for multiple users of the same repo to find their own branches (so something like "marten395/vrs_test" would be good, but personal info typically won't be in the script name itself (something like "create_test_dataset.py" would be good)

matren395 commented 1 year ago

Hopefully this last round of commits should be the last and be ready to merge ?