RobertsLab / resources

https://robertslab.github.io/resources/
19 stars 11 forks source link

More efficient ways to do basic stats in R? #241

Closed kaitlynrm closed 6 years ago

kaitlynrm commented 6 years ago

I'm currently using matrixStats to do basic stats in R to generate a table with several quantitative variables- median, mean, variance, skewness, kurtosis, etc. I did this in Excel previously but am trying to replicate it in R.

Often matrixStats has me generate a matrix instead of working directly off the df. Is there another package that has more stats options or a package that works directly off the data frame (so the process is faster)?

sr320 commented 6 years ago

Can you provide a data table and a specific task to perform on the table?

On Tue, May 1, 2018 at 12:15 PM Kaitlyn Mitchell notifications@github.com wrote:

I'm currently using matrixStats https://www.rdocumentation.org/packages/matrixStats/versions/0.53.1 to do basic stats in R to generate a table with several quantitative variables- median, mean, variance, skewness, kurtosis, etc. I did this in Excel previously but am trying to replicate it in R.

Often matrixStats has me generate a matrix instead of working directly off the df. Is there another package that has more stats options or a package that works directly off the data frame (so the process is faster)?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/RobertsLab/resources/issues/241, or mute the thread https://github.com/notifications/unsubscribe-auth/AEPHt6V8n_YJYmWRyAo22o3vACZCvRIwks5tuLQugaJpZM4TuXWf .

kaitlynrm commented 6 years ago

There are a few stats I haven't been able to find: coefficient of variance, kurtosis, and skewness that I would like applied to all rows on this table. Additionally, if there is another package that can generate standard deviation, variance, and quartiles without the extra step of making the df a matrix, that would be nice to know as well.

I know how to apply the functions in matrixStats and those in the base R already, but was just wondering if there was a faster way of doing it.

kubu4 commented 6 years ago

Can you give a brief explanation about your table? Specifically, what the rows and columns indicate?

kaitlynrm commented 6 years ago

The rows are the detected proteins. Columns are Silo_day. So each column has the NSAF value for each day measured based on each detected protein.

kubu4 commented 6 years ago

Ok, thanks.

Now, what is one thing you're trying to determine from this data and why? Also, what is the issue you're having in making this determination?

kaitlynrm commented 6 years ago

Well I'm just adding basic statistical measures in columns for each protein. I have been looking up packages with some stats and it's been working fine. However I often have to switch between matricies and dataframes and load several packages for these functions. Steven suggested I post here to see if anyone knew of a more encompassing package or faster way to perform them.

There is no one specific thing I'm looking for per say....

kubu4 commented 6 years ago

faster way

Can you expand on this a bit? What aspect of the process do you find to be too slow? E.g.

Maybe also post a link to your R script(s) and indicate what aspects of the script you wish were faster?

Additionally, if there is another package that can generate standard deviation, variance, and quartiles without the extra step of making the df a matrix

No packages needed for these and these are functions that work directly on data frames:

There are a few stats I haven't been able to find: coefficient of variance, kurtosis, and skewness

adding basic statistical measures in columns for each protein.

Why? If you're setting up an R script to perform the analysis, there's no real need to append the data to the table.

With all of that said, what are you going to do with these stats? Why calculate these?

applied to all rows

R is set up to analyze data in columns. I think you'll need to transpose your data so that protein NSAF values are in columns and silo_day are in rows.

sr320 commented 6 years ago

Note - this was just my request to see what I could do with dplyr On Sat, May 5, 2018 at 7:43 AM kubu4 notifications@github.com wrote:

faster way

Can you expand on this a bit? What aspect of the process do you find to be too slow? E.g.

  • requires too much typing?
  • loading packages takes a long time?
  • computer takes a long time to perform analysis?

Maybe also post a link to your R script(s) and indicate what aspects of the script you wish were faster?

Additionally, if there is another package that can generate standard deviation, variance, and quartiles without the extra step of making the df a matrix

No packages needed for these and these are functions that work directly on data frames:

  • sd()
  • var()
  • quantile()

There are a few stats I haven't been able to find: coefficient of variance, kurtosis, and skewness

  • coefficient of variance is: sd()/mean()
  • kurtosis and skewness both need an external library. library(e1071) will calculate both and will work on data frames.

adding basic statistical measures in columns for each protein.

Why? If you're setting up an R script to perform the analysis, there's no real need to append the data to the table.

With all of that said, what are you going to do with these stats? Why calculate these?

applied to all rows

R is set up to analyze data in columns. I think you'll need to transpose your data so that protein NSAF values are in columns and silo_day are in rows.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/RobertsLab/resources/issues/241#issuecomment-386810361, or mute the thread https://github.com/notifications/unsubscribe-auth/AEPHtxHB-coN04p122TkeIW1niaofGA6ks5tvbpggaJpZM4TuXWf .

laurahspencer commented 6 years ago

Ya'll know about aggregate() function? Here's a script I used recently to generate a dataframe with the count, mean, and standard deviation of oyster lengths by treatment group; note that the original dataframe used (DF1) is in the long form - melted via the Reshape2 package and melt() function.

DF2 <- cbind(
aggregate(value ~ GROUP, DF1, length, na.action = na.omit), aggregate(value ~ GROUP, DF1, mean, na.action = na.omit)[2], aggregate(value ~ GROUP, DF1, sd, na.action = na.omit)[2] ) colnames(DF2) <- c("Group", "Count", "Mean", "SD")