OtagoStudyGroup / studyGroup

http://otagostudygroup.github.io/studyGroup
Other
1 stars 8 forks source link

3:00pm April 14th rm319 Biochem - Large Data Packages in R: data.table and more! #21

Closed TomKellyGenetics closed 7 years ago

TomKellyGenetics commented 8 years ago

The upcoming SYSKA will be a demonstration of some powerful R packages to handle large data types, delivered by me! We will focus on the data.table package and it's modifications from dataframes which most of you will have seen before.

If we have time I'll compare it to other approaches: dataframes, matrices, bigmemory, RMySQL and do some benchmarking on speed. This format works a bit differently to what most of R does but it's got some cool features, including fast reading data files into R, and is compatible with most functions (including Hadley's packages) using dataframes.

You're welcome to follow along, I'll be using using the gapminder dataset from Software Carpentry workshows. I'll post a link to the data closer to the time.

We're also discussing bringing back the 5x5 Lightning format in a couple of weeks, it's a great opportunity to have a go at SYSKA if any of our newest study-group-ers want to have a go at one. Let us know! :)

TomKellyGenetics commented 8 years ago

Hadley's new FEATHER data/file type may be discussed too, the main selling point is speed between the disk and R. It's developed in collaboration with Wes the "pandas" guy so Python fans, this is a cross-over event, it is compatible with both systems.

TomKellyGenetics commented 8 years ago

For future reference the gapminder data is here: https://github.com/resbaz/r-novice-gapminder/tree/gh-pages/data I'll likely figure out pulling it directly from GitHub and give the code at the SYSKA session anyway.

TomKellyGenetics commented 8 years ago

Sorry, to mess with our regular time. I've got a demonstrators meeting at 4 so (if possible) I'd like to start half an hour earlier than usual. I'll have to leave at 4 but you're welcome to continue discussion, tinkering with data.table, or "hackyhour" type questions about code.

As usual you are welcome to bring your laptops to follow along and try things out for yourself. We'll be using a few R packages on CRAN, I think most of are set up with R/RStudio installed and access to the university/eduroam network but we can help with this on the day too.

TomKellyGenetics commented 8 years ago

Hi everyone, I've posted the lesson notes and data used in this github repository. See this repo for the latest version, we will include this in our studyGroup lesson too. https://github.com/TomKellyGenetics/syska-R-data-table

TomKellyGenetics commented 8 years ago

We'll get everyone set up on the day but if it's easier to install the packages in advance (with network access) you can download the lesson materials with the following command in a Mac, Bash, or Git shell:

git clone git@github.com:TomKellyGenetics/syska-R-data-table.git

The following command will install the CRAN packages in R:

install.packages(c("data.table", "ggplot2", "plyr", "bigmemory", "devtools", "gplots"))

This will install the FEATHER package from GitHub:

library("devtools")
devtools::install_github("wesm/feather/R")

Installing FEATHER requires large dependancies - so is best installed in advance - but is only used to demonstrate a newer alternative some of the functions in data.table or bigmemory. You only need to install these packages in advance if you wish to try this package yourself during the meeting.

murraycadzow commented 8 years ago

Due to sickness, Nick B will now be taking this session