Detailed plan for Snowflake Chapter

Come up with a plan for the Snowflake Chapter: what should be in there and which of these are necessary versus nice. These can be ticked when they have a TODO in the document. The less important TODOs (that don't get done) can be moved to Future Work when finishing up the draft:

[x] Data (decide where these graphs/examples are coming from and write up):
- [x] ALSPAC data set write up.
  - [x] Background what is the data set
  - [x] Raw data (genotype + phenotype), ethics, sequencing type
  - [x] Creating the inputs
  - [x] Missing data
  - [x] EDA:
  - [x] Distribution of number of SNPs per phenotype (ALPAC)
  - [x] Distribution of number SNP scores within phenotypes (violin plot with some examples).
- [x] 2500 Genomes set write up
- [x] (Optional) 23andMe data set write up/athletes
- [x] (Optional) CAGI data set write up.
[x] Clustering SNPs by phenotype
- [x] Creating the input scores
  - [x] DcGO "Phenotypes" with weird combinations of phenotypes
  - [x] DcGO prediction, where SNP is in a gene which is not expressed in the tissue.
  - [x] Effect of number of SNPs per phenotypes on the sensitivity of the final score to the FATHMM score.
    - [x] Choose a phenotype with many snps and randomly sample various numbers of them and see the how sensitive the results are.
- [x] Sensitivity of clustering score to background cohort
- [x] Dimensionality reduction (When is dimensionality reduction appropriate?)
  - [x] Correlation between SNPs FATHMM scores
  - [x] Too many SNPs for a phenotype.
[x] Results
- [x] EDA Predictions
- [x] Number of predictions per phenotype, for:
  - [x] ALSPAC (histogram)
  - [x] (Optional) CAGI (histogram)
  - [x] (Optional) Genetrainer (will just be one number since one phenotype of interest).
- [x] Validation
  - [x] Bootstrapping graph and ROC curve (showing that it doesn’t work overall)
  - [x] For ALSPAC
  - [x] (Optional) For CAGI
  - [x] (Optional) For Genetrainer
- [x] Examples of predictions (ALSPAC), e.g.
- [x] re-finding known things
- [x] Show that single-SNP phenotypes get the "correct" result for people (SNPs).
- [x] Predictions that are made using information from non-human experiments
- [x] Predictions where you need a combination of SNPs for a trait.
- [x] Predictions that find new SNPs in a known gene
[x] Discussion:
- [x] Linkage disequilibrium
- [x] (Optional) Phenotypes where haplotype is not how things are clustering versus where they are

Also any setup/admin:

[x] Set up the ipynb with jupytext myst md paired.
[x] Repository:
- [x] Check if I have an existing repo or not. (I DO NOT)
- [x] Create repository: BE MINIMAL. This will be private anyway. No license (yet).
  - [x] Very basic README.
  - [x] Directory structure
[x] Check what Jan discussed to add things to plan
[x] Update issue #41 with all the to-dos

NatalieZelenka / phenotype_from_genotype

Detailed plan for Snowflake Chapter #40