Come up with a plan for the Snowflake Chapter: what should be in there and which of these are necessary versus nice. These can be ticked when they have a TODO in the document. The less important TODOs (that don't get done) can be moved to Future Work when finishing up the draft:
[x] Data (decide where these graphs/examples are coming from and write up):
[x] ALSPAC data set write up.
[x] Background what is the data set
[x] Raw data (genotype + phenotype), ethics, sequencing type
[x] Creating the inputs
[x] Missing data
[x] EDA:
[x] Distribution of number of SNPs per phenotype (ALPAC)
[x] Distribution of number SNP scores within phenotypes (violin plot with some examples).
[x] 2500 Genomes set write up
[x] (Optional) 23andMe data set write up/athletes
[x] (Optional) CAGI data set write up.
[x] Clustering SNPs by phenotype
[x] Creating the input scores
[x] DcGO "Phenotypes" with weird combinations of phenotypes
[x] DcGO prediction, where SNP is in a gene which is not expressed in the tissue.
[x] Effect of number of SNPs per phenotypes on the sensitivity of the final score to the FATHMM score.
[x] Choose a phenotype with many snps and randomly sample various numbers of them and see the how sensitive the results are.
[x] Sensitivity of clustering score to background cohort
[x] Dimensionality reduction (When is dimensionality reduction appropriate?)
[x] Correlation between SNPs FATHMM scores
[x] Too many SNPs for a phenotype.
[x] Results
[x] EDA Predictions
[x] Number of predictions per phenotype, for:
[x] ALSPAC (histogram)
[x] (Optional) CAGI (histogram)
[x] (Optional) Genetrainer (will just be one number since one phenotype of interest).
[x] Validation
[x] Bootstrapping graph and ROC curve (showing that it doesn’t work overall)
[x] For ALSPAC
[x] (Optional) For CAGI
[x] (Optional) For Genetrainer
[x] Examples of predictions (ALSPAC), e.g.
[x] re-finding known things
[x] Show that single-SNP phenotypes get the "correct" result for people (SNPs).
[x] Predictions that are made using information from non-human experiments
[x] Predictions where you need a combination of SNPs for a trait.
[x] Predictions that find new SNPs in a known gene
[x] Discussion:
[x] Linkage disequilibrium
[x] (Optional) Phenotypes where haplotype is not how things are clustering versus where they are
Also any setup/admin:
[x] Set up the ipynb with jupytext myst md paired.
[x] Repository:
[x] Check if I have an existing repo or not. (I DO NOT)
[x] Create repository: BE MINIMAL. This will be private anyway. No license (yet).
[x] Very basic README.
[x] Directory structure
[x] Check what Jan discussed to add things to plan
Come up with a plan for the Snowflake Chapter: what should be in there and which of these are necessary versus nice. These can be ticked when they have a TODO in the document. The less important TODOs (that don't get done) can be moved to Future Work when finishing up the draft:
Also any setup/admin: