Chapter 9 Vocab - Githubissues

Very nice start on this, @serewilliams ! I have some suggestions for edits below. Once you update the list to address these, @baileyfosdick will take a look and make some of her own suggestions.

It's fine to use quotation marks in your definitions (as in the definition of "robust"), but I recommend that you use a single quotation mark on each side. Sometimes, using a double quotation mark in a flat file will cause some problems in reading the file in. I'm not sure if that would create problems this time, but I recommend changing " to ' throughout, just in case. (Also, sometimes the double quote characters are printed using non-ASCII characters, which can occasionally gum things up.)
For "outlier", I suggest adding "potentially" before "dominating" (sometimes you can have an outlier with very little influence on the analysis you do---for example, if you're fitting a regression line, if you have an outlier that's far from the other points but still falls along the line, it won't change your estimate of the slope much to take out that outlier).
For "taxa / taxon", since you have the definition in the singular, I suggest just using "taxon" for the vocabulary term.
For microarrays, are these used to analyze something specific (e.g., gene expression), or really can they be used for a large number of analyses? If they're typically used for one specific type of outcome, could you edit the definition to specify that?
For your definition of "assay", could you be more specific? The definition that Wikipedia has is "An assay is an investigative (analytic) procedure in laboratory medicine, pharmacology, environmental biology and molecular biology for qualitatively assessing or quantitatively measuring the presence, amount, or functional activity of a target entity (the analyte).", which seems reasonable and could work here. The definition right now is broad enough that it could include scientific procedures done in a lot of different fields, and for a lot of different outcomes, but you usually only hear "assay" used in some specific biological fields.
I think there should be a hyphen in "chi-square" in "chi-square distance". Also, I think this definition might be wrong. I think that "chi-square distance" is a specific way of calculating the distance between two points (in line with some of the other distance measurements that we covered in an earlier chapter, like Euclidean and Manhattan). Here's one source that I found that had a bit more on this: http://www.econ.upf.edu/~michael/stanford/maeb4.pdf. I suggest looking a bit more into this term and editing the definition once you do. I think the current definition is getting a bit confused with a chi-squared statistic.
I think the definition for "contingency table" is a bit too specific. I think it can be used to show the relationship between any pair of categorical variables measured for a set of observations, not just phenotypes. Could you edit that definition to make that point clearer (and perhaps you could still include something like, "for example, two phenotypes").
I think the definition for "nonlinear equations" might be a bit off. Typically, we think of "nonlinear" in a regression model as begin one that isn't a simple "linear" combination of the independent parameters. For a "linear" regression equation, you're essentially just weighting each independent variable by a weight (which you estimate as the regression coefficient for that parameter) and then summing those together to estimate your dependent variable. You don't include any transformations of the independent parameters---for example, you won't have any terms that are squared or lots or the power for another variable or anything like that. Here's a link to a blog post that covers this distinction: https://statisticsbyjim.com/regression/difference-between-linear-nonlinear-regression-models/.
For your definition of "confounded effects", could you add at the beginning something like "a term describing when there is"? I think you wouldn't describe these effects as the uncertainty itself, which is suggested by the current set-up of the definition (rather, the uncertainty in where the variation comes from means that you have a really hard time disentangling the separate influences of two or more variables).
For "co-occurrence matrix", maybe update to "a matrix that captures the extent to which variables are jointly observed in observations". Also, for a fun example, see this example of the co-occurrence of different characters in chapters of Les Miserables...
In the definition for "kernel", I think you're close, but it would be nice to include something about how this uses a linear algorithm to try to determine a non-linear decision boundary. Try googling "kernel trick" to see if you can get a bit more on what makes kernels special in statistics / machine learning.
For "penalty", I think we want to make sure that the idea comes through that this is a way to constrain the typical optimization algorithm when running an analysis. See here for a bit more.

Hi Brooke,

Thank you for these detailed suggestions (and fun links).

The only comment I have is that in researching "nonlinear" (from the excellent link you added) it seems that the regression equation is linear as long as the 'parameters are linear'. Even thought it's counter-intuitive, they are linear if the independent variables can be related through multiplying, exponentiating, or transforming. Let me know if I'm misinterpreting this.

I've completed the edits and redone the tabs. I've committed changes. I think it should still be in the same pull request, yes?

Thank you, Seré

On Mon, Apr 13, 2020 at 2:12 PM Brooke Anderson notifications@github.com wrote:

Very nice start on this, @serewilliams https://github.com/serewilliams ! I have some suggestions for edits below. Once you update the list to address these, @baileyfosdick https://github.com/baileyfosdick will take a look and make some of her own suggestions.

It's fine to use quotation marks in your definitions (as in the definition of "robust"), but I recommend that you use a single quotation mark on each side. Sometimes, using a double quotation mark in a flat file will cause some problems in reading the file in. I'm not sure if that would create problems this time, but I recommend changing " to ' throughout, just in case. (Also, sometimes the double quote characters are printed using non-ASCII characters, which can occasionally gum things up.)

For "outlier", I suggest adding "potentially" before "dominating" (sometimes you can have an outlier with very little influence on the analysis you do---for example, if you're fitting a regression line, if you have an outlier that's far from the other points but still falls along the line, it won't change your estimate of the slope much to take out that outlier).

For "taxa / taxon", since you have the definition in the singular, I suggest just using "taxon" for the vocabulary term.

For microarrays, are these used to analyze something specific (e.g., gene expression), or really can they be used for a large number of analyses? If they're typically used for one specific type of outcome, could you edit the definition to specify that?

For your definition of "assay", could you be more specific? The definition that Wikipedia has is "An assay is an investigative (analytic) procedure in laboratory medicine, pharmacology, environmental biology and molecular biology for qualitatively assessing or quantitatively measuring the presence, amount, or functional activity of a target entity (the analyte).", which seems reasonable and could work here. The definition right now is broad enough that it could include scientific procedures done in a lot of different fields, and for a lot of different outcomes, but you usually only hear "assay" used in some specific biological fields.

I think there should be a hyphen in "chi-square" in "chi-square distance". Also, I think this definition might be wrong. I think that "chi-square distance" is a specific way of calculating the distance between two points (in line with some of the other distance measurements that we covered in an earlier chapter, like Euclidean and Manhattan). Here's one source that I found that had a bit more on this: http://www.econ.upf.edu/~michael/stanford/maeb4.pdf. I suggest looking a bit more into this term and editing the definition once you do. I think the current definition is getting a bit confused with a chi-squared statistic.

I think the definition for "contingency table" is a bit too specific. I think it can be used to show the relationship between any pair of categorical variables measured for a set of observations, not just phenotypes. Could you edit that definition to make that point clearer (and perhaps you could still include something like, "for example, two phenotypes").

I think the definition for "nonlinear equations" might be a bit off. Typically, we think of "nonlinear" in a regression model as begin one that isn't a simple "linear" combination of the independent parameters. For a "linear" regression equation, you're essentially just weighting each independent variable by a weight (which you estimate as the regression coefficient for that parameter) and then summing those together to estimate your dependent variable. You don't include any transformations of the independent parameters---for example, you won't have any terms that are squared or lots or the power for another variable or anything like that. Here's a link to a blog post that covers this distinction: https://statisticsbyjim.com/regression/difference-between-linear-nonlinear-regression-models/ .

For your definition of "confounded effects", could you add at the beginning something like "a term describing when there is"? I think you wouldn't describe these effects as the uncertainty itself, which is suggested by the current set-up of the definition (rather, the uncertainty in where the variation comes from means that you have a really hard time disentangling the separate influences of two or more variables).

For "co-occurrence matrix", maybe update to "a matrix that captures the extent to which variables are jointly observed in observations". Also, for a fun example, see this example https://bost.ocks.org/mike/miserables/ of the co-occurrence of different characters in chapters of Les Miserables...

In the definition for "kernel", I think you're close, but it would be nice to include something about how this uses a linear algorithm to try to determine a non-linear decision boundary. Try googling "kernel trick" to see if you can get a bit more on what makes kernels special in statistics / machine learning.

For "penalty", I think we want to make sure that the idea comes through that this is a way to constrain the typical optimization algorithm when running an analysis. See here https://en.wikipedia.org/wiki/Penalty_method for a bit more.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/geanders/csu_msmb/pull/59#issuecomment-613074906, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHQOHZTS3ZW4J3UA6P7E543RMNW23ANCNFSM4MHFNHTA .

Seré Williams, M.S.

PhD Student | Cell & Molecular Biology

Name Pronunciation: sir-EE

303.550.4375

sere.a.williams@gmail.com or sere.williams@colostate.edu

@serewilliams : Thanks for these edits!

For your question about linear--I think that if you relate the parameters through something like exponentiating or transforming, it would make the equation non-linear. I'll let @baileyfosdick chime in on that, though, in terms of the characteristics that distinguish a system of linear equations from a system of non-linear equations.

Also, @baileyfosdick , I think these vocab terms are ready for your suggestions: https://github.com/geanders/csu_msmb/blob/9dc8eccdd17444c7324ef62c5681ca7f2ce9c749/content/post/vocab_lists/chapter_9.tsv

Once you take a pass, I'll integrate into the website and get the post published.

I still need to embed the quizlet and check the tab formatting before it goes up.

Here is the part of the blog that I think answers the linear/nonlinear question:

From: https://statisticsbyjim.com/regression/difference-between-linear-nonlinear-regression-models/

Statisticians https://statisticsbyjim.com/glossary/statistics/ say that this type of regression equation is linear in the parameters https://statisticsbyjim.com/glossary/parameter/. However, it is possible to model curvature with this type of model. While the function must be linear in the parameters, you can raise an independent variable by an exponent to fit a curve. For example, if you square an independent variable, the model can follow a U-shaped curve.

[image: Y =\beta {0} + \beta {1}X{1} + \beta {2}X_{1}^2]

While the independent variable is squared, the model is still linear in the parameters. Linear models can also contain log terms and inverse terms to follow different kinds of curves and yet continue to be linear in the parameters.

The regression example below models the relationship between body mass index (BMI) and body fat percent. In a different blog post, I use this model to show how to make predictions with regression analysis https://statisticsbyjim.com/regression/predictions-regression/. It is a linear model that uses a quadratic (squared) term to model the curved relationship.

On Mon, Apr 13, 2020 at 3:36 PM Brooke Anderson notifications@github.com wrote:

@serewilliams https://github.com/serewilliams : Thanks for these edits!

For your question about linear--I think that if you relate the parameters through something like exponentiating or transforming, it would make the equation non-linear. I'll let @baileyfosdick https://github.com/baileyfosdick chime in on that, though, in terms of the characteristics that distinguish a system of linear equations from a system of non-linear equations.

Also, @baileyfosdick https://github.com/baileyfosdick , I think these vocab terms are ready for your suggestions: https://github.com/geanders/csu_msmb/blob/9dc8eccdd17444c7324ef62c5681ca7f2ce9c749/content/post/vocab_lists/chapter_9.tsv

Once you take a pass, I'll integrate into the website and get the post published.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/geanders/csu_msmb/pull/59#issuecomment-613109982, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHQOHZWLDPBJHJDOOKM7LOLRMOAWHANCNFSM4MHFNHTA .

Seré Williams, M.S.

PhD Student | Cell & Molecular Biology

Name Pronunciation: sir-EE

303.550.4375

sere.a.williams@gmail.com or sere.williams@colostate.edu

Yep, that's definitely consistent with what you were saying. I didn't think that an equation could have a log or inverse of an independent variable and still be linear, though. I will let @baileyfosdick help us with this question. I may have sent you to a bad source, in which case, my apologies!

I completely understand. When google does not come through, always good to have an expert on hand. I've added and embedded the quizlet and checked that the tabs are being read as tabs (I moved it into BBedit to do this). I'll integrate Bailey's comments when I get them.

On Mon, Apr 13, 2020 at 3:48 PM Brooke Anderson notifications@github.com wrote:

Yep, that's definitely consistent with what you were saying. I didn't think that an equation could have a log or inverse of an independent variable and still be linear, though. I will let @baileyfosdick https://github.com/baileyfosdick help us with this question. I may have sent you to a bad source, in which case, my apologies!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/geanders/csu_msmb/pull/59#issuecomment-613114172, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHQOHZVLOAMGX2ETCIU53PTRMOCBDANCNFSM4MHFNHTA .

Seré Williams, M.S.

PhD Student | Cell & Molecular Biology

Name Pronunciation: sir-EE

303.550.4375

sere.a.williams@gmail.com or sere.williams@colostate.edu

@serewilliams You are correct in your understanding of linear equation. A linear equation is one where the dependent variable Y is a linear function of the parameters. Thus the independent variables (X) can be transformed, but not the Y variable.

Ok, great. Thank you.

I think it is ready to go. I believe the current version on github is the final and the pull request is still open. Let me know if there's anything else I need to do on my end

Thank you, Seré

On Tue, Apr 14, 2020 at 1:33 PM Bailey Fosdick notifications@github.com wrote:

@serewilliams https://github.com/serewilliams You are correct in your understanding of linear equation. A linear equation is one where the dependent variable Y is a linear function of the parameters. Thus the independent variables (X) can be transformed, but not the Y variable.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/geanders/csu_msmb/pull/59#issuecomment-613639727, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHQOHZTA3CWAQGDF6PZIA7DRMS3BDANCNFSM4MHFNHTA .

Seré Williams, M.S.

PhD Student | Cell & Molecular Biology

Name Pronunciation: sir-EE

303.550.4375

sere.a.williams@gmail.com or sere.williams@colostate.edu

geanders / csu_msmb

Chapter 9 Vocab #59