Corrections: list - Githubissues

[x] Abstract
- [x] Simplify line “In our cells, proteins are constantly being created and are degrading, and are accumulating or interacting to produce the phenotypes that we see at a larger scale: height, levels of enzymes in blood, diseases”
- [x] Clarify “The versions of the proteins that it is possible for an organism to produce are determined primarily by its protein coding DNA, while the selection of possible proteins that are actively produced in each cell are determined by the environment of the particular cell at each time”
[x] Chapter 2:
- [x] Move Section “2.1.2. The future computational biologists want” to later part of the chapter.
- Moved to 2.5 "Phenotype"
- [x] In the fourth para below fig 2.3 “post‑transcriptional modificationssuch as splicing” please make the spacing between “post‑transcriptional modifications” and “such” prominent. Also add RNA editing and RNAi as modes of post-transcriptional regulations.
- [x] In the section “2.2.1.3. “RNA makes Proteins”, a.k.a. Translation” it may be worth mentioning landmark work of Christian Anfinsen about how sequence of amino acid strings of protein acts as a “code” to precisely determine three-dimensional structure of the protein.
- [x] In 2.2.1.4 “a different amino acid in a hormone protein could cause the protein to be expressed differently” could change it to “a different amino acid in a hormone protein could cause the protein to be expressed or function differently”
- [x] In 2.3.2 “(protein‑coding nucleotides)” to “(protein‑coding part of DNA)”; there is repeating word: “in in”, please modify it.
- [x] In 2.3.3.1. “A gene for X” what does “the same gene can make multiple different proteins” mean is not clear, does it mean different isoforms of the protein?
- [x] In 2.3.6.2 “However, synonymous SNVs could still have an effect on highlevel traits, since different nucleotides are translated at different speeds.” Here translation at different speed can have effect on both folding and abundance of protein (Kimchi-Sarfaty et al, 2007, Science) Please add few sentences along these lines.
- [x] It is not clear at many place what is different proteins being referred that is encoded by the same gene? It means protein isoforms or something else?
- [x] make the motivation for your work stronger
- Did this more in the Chapter 2 and Chapter 3 summaries
- [x] in general, needs further references to relevant works
[x] Chapter 3:
- [x] There seems to active and passive voice mix up please follow one mode at least within a chapter. 3.2.1
- [x] In 3.2.1.1 “Whole genomes for different organisms can be compared to one another to give us insight about the organisms, or within an organism, individuals can be compared to understand the importance of sections of DNA for that organism.” Not clear what the term plural organisms mean here?
- [x] In this section there should be a clear demarcation between whole genome assembly, gene annotation and then variants calling; there appears to be some mix up for me that impedes the smooth flow of information content.
- Went over the signposting information at the start of each section to clearly describe where this information is, but still wanted to keep the symmetry between Chapter 2 and 3.
- [x] Possibly mention about OMIM and HGMD databases
- [x] 3.2.2.2 “Measures of mRNA abundance (i.e. gene expression data) are generally considered the best measures of translation (compared to protein abundance for example), and therefore the best data to tell us how DNA’s blueprints are being used in different scenarios” Appears to contrary to general belief that protein abundance are in general better measure. In situations where protein abundances are not easily measurable or trackable mRNA expression can be used as a good proxy for protein abundance. In fact your later aside “Gene Expression and Protein Abundance data” clearly reflects this.
- [x] 3.2.4. Phenotypes Please provide smooth link between this section to the next “connecting genotypes and phenotypes”.
- [x] What is the link between 3.3 and 3.4 is it clear?
- the computational methods (3.4) often use the ontologies and databases described in the previous sections (3.3 and previous). I've added a sentence to signpost this better at the start of 3.4.
- [x] Section 3.5 can be moved towards the end of the chapter before 3.7 summary
- I didn't do this because 3.5 (which mentions the different sources of bias in computational biology) motivates 3.6 (which introduces a project that I worked on - PQI - which aims to combat this. But I made i clearer to the reader).
- [x] Also in the summary it is foremost importance to highlight the core of the chapter genotypes and phenotypes, and linking them and related data sources. The description about bias, potential statistical pitfalls can be mentioned later.
- Reinforced mention of core information and related bias to the work done later.
- [x] be more specific about your contribution
- My contributions are listed in a yellow box at the start of the PQI section and at the start of the Chapter.
- [x] the superfamily section needs to be expanded
- expanded substantially
[ ] Chapter 4:
- [x] #63
  - Yes, this is described in section 4.6.1.3. But I now signpost it earlier (4.1 "Introduction" and 4.2.1 "Approach") by describing that Snowflake is not meant to perfectly predict all phenotypes, but to uncover mechanisms for some phenotypes.
- [ ] 4.2.2.2. Restricted phylogeny could have been better for deleterious variant predictions?
- [ ] 4.2.2.3 and 4.2.2.4 Schematic illustration of detailed steps would have been very helpful.
- [ ] Did you try different clustering methods and check of consistency?
  - [ ] Describe: yes. The clustering methods did give quite different results. But this would be expected.
- [ ] Did you use different distance measure and figure best performing one? Or the Euclidean measure was the only choice?
  - [ ] Describe: yes. Tried euclidean as well as what I actually did, which was not euclidean. But can only motivate choice theoretically because there isn't the data to trial lots of different things.
- [ ] Why did you not use UK biobank data instead of 1000 genomes data?
  - [ ] Describe: wasn't available.
- [ ] If you explain in more detail how your contribution has significantly enhanced the snowflake it would have been excellent.
- [ ] What is typical range of phenotype score?
- [ ] What is max and min in your application across datasets? Although would depend on the dataset, can you provide a flavor for a typical range?
- [ ] insert pseudocode (4.2.2)
- [x] compare with new version of snowflake (it it’s available)
  - Was not available
- [ ] expand the metrics/clustering part to better justify choice of dataset
- [ ] the actual results section is very short, need to be expanded and more detailed
[ ] Chapter 5
- [ ] Application of snowflake to ALSPAC
- [ ] What is current state of application of snowflake to ALSPAC
- [ ] In section “5.2.1. Selection of phenotypes” limitation of snowflake could be due to relatively limited data? Or mutations in regions beyond domains? Did you check?
- [ ] “In selecting phenotypes, I considered only (1) whether Snowflake considered these to be phenotypes where it could make a confident prediction and (2) whether the phenotypes in ALSPAC could be used to validate this prediction. I did not consider additional information that might indicate whether these were phenotypes we might expect to be able to predict, for example, whether these phenotypes were heritable, or consider whether they are desirable to predict. Since I chose these purely by looking at the distribution of scores for Snowflake, our lack of promising results could be an indication that the phenotype‑score (finding interesting distributions of phenotypes) is unsuccessful.” This para is not clear to me. Can you explain?
- [ ] In terms of snowflake application to ALSPAC. Did you consider randomizing the data or generate hypothetical random data and apply snowflake and compare the phenotype score with the ones you got for the original application of snowflake to ALSAPAC?
- [ ] extremely short, consider merging with Chapter 4
[ ] Chapter 6:
- [ ] Integration of gene expression (tissue-specific) did improve genotype-phenotype prediction to want extent though?
- [ ] Why GTex datasets were not considered?
- [ ] How does isoform expression factor in to this equation?
- [ ] Why proteomics data was not considered?
- [ ] How did you deal with expression data supported by multiple different studies (biological replicates) to be expressed as against to those supported by limited number of studies or samples?
- [ ] I have an issue with being completely having belief in uberon as the gene expression is far more prone to rapid rewiring/reprogramming as compared to protein coding regions.
- [ ] 6.5.1. CAFA 2 Fmax appears quite low (extrapolating from machine learning studies)
- [ ] it might be worth discussing how much results depend on the chosen dataset, and/or on the use of DcGo.
[ ] 7. Ontolopy
- [ ] Very interesting package that can be used to glean data from OBO files and manipulate them in a customized form. Can Ontolopy be used to build knowledge graphs? Or can be enhanced in future to do so? (Yes very similar)
- [ ] Like gene ontologies are there one to many mappings, in that case is it possible to glean to most relevant mapping in a context specific manner?
- [ ] How reliable are uberon to sample mappings in general? Are there some examples to clearly demonstrate this?
[ ] 8. Combining RNA‑seq datasets
- [ ] Are there specific demonstrable of Combining RNA‑seq datasets at the data level not at the primary results level. These data might have been acquired in distinct conditions from slightly distinct sample. Are the data being treated either as biological or technical replicates?
- [ ] Does combining gene expression improve correlation with protein abundance?
[ ] Misc
- [ ] Add page numbers

NatalieZelenka / phenotype_from_genotype

Corrections: list #60