STAT545-UBC / Discussion

Public discussion
37 stars 20 forks source link

Automating Data Analysis Pipelines #403

Open sjackman opened 7 years ago

sjackman commented 7 years ago

Please post here your questions related to today's lecture on Automating Data Analysis Pipelines. For questions related to installing Make and command line tools on Windows please see https://github.com/STAT545-UBC/Discussion/issues/397#issuecomment-259244767

I've posted the exact code from today's live demonstration online at https://github.com/sjackman/makefile-example-stat545

This example Makefile uses curl, awk and other common command line utilities. You may if you prefer follow this example instead: http://stat545.com/automation04_make-activity.html It's a very similar pipeline to what I did today in class, but rather than using command line utilities for each step of the pipeline, it uses R scripts. Instead of using the command line utility curl to download the data, for example, it uses the R function download.file. The only two command line utilities that you need for this activity are make and Rscript.

Finally, for the adventurous, a more fully-featured example of the same analysis is available at https://github.com/sjackman/makefile-example It adds a parameterized RMarkdown report, a comparison plot of the two word length distributions, and a statistical test to show that the difference in the two distributions is in fact statistically significant.

sjackman commented 7 years ago

Amgen was able to reproduce 6 of 53 “landmark” cancer studies.

In 2012, Amgen alarmed the scientific world by revealing that it had been able to reproduce the results of only six out of 53 “landmark” cancer studies. This confirmed similar, worrying findings from German drug company Bayer released the previous year.

https://www.timeshighereducation.com/news/amgen-launches-new-platform-help-fix-scientific-reproducibility-crisis

ksedivyhaley commented 7 years ago

I am getting a bunch of errors when I try to run the class demonstration. First:

make: *\ No rule to make target 'en.html', needed by 'all'. Stop. Exited with status 2.

I guessed that this had to do with using % as a placeholder for the language, so I wrote out a command to explicitly specify en.html :

en.html: en.rmd en.tsv
    Rscript -e 'rmarkdown::render("$*.rmd")'

Then I got a new error:

curl: not found

Is there something I still need to install? The pipeline without using command line utilities worked fine. I'm using Windows and didn't get a chance to execute the live example in class because I had problems installing make.

sjackman commented 7 years ago

Homework 7 is posted and due anytime Monday 2016-11-18.

sjackman commented 7 years ago

@ksedivyhaley For troubles installing the software on Windows, please see https://github.com/STAT545-UBC/Discussion/issues/397#issuecomment-259244767 As an alternative, use Rscript -e "download.file(...)" rather than curl in your Makefile.

samhinshaw commented 7 years ago

@ksedivyhaley @sinaneza had the same issue with %.html, but when en.rmd and fr.rmd were added to the directory, %.html was correctly identified and interpreted by Make. However, your workaround is spot on--that's the first thing we tried as well!

As for curl, many other students on Windows had the exact same issue. curl was properly identified when called in git bash, but when the Makefile was run in RStudio via Build All, curl could not be found. This is most likely a path, but fortunately as @sjackman has pointed out, there are MANY different ways to download a file, and doing it via R is a very simple workaround!

Another solution would include running the makefile via git bash. Installing the ubuntu subsystem for windows may help as well!

sjackman commented 7 years ago

@ksedivyhaley

en.html: en.rmd en.tsv
    Rscript -e 'rmarkdown::render("$*.rmd")'

Note that $* only works inside a pattern rule. You can use either

%.html: %.rmd %.tsv
    Rscript -e 'rmarkdown::render("$*.rmd")'

or

en.html: en.rmd en.tsv
    Rscript -e 'rmarkdown::render("en.rmd")'
sjackman commented 7 years ago

Another solution would include running the makefile via git bash.

Rather than running make inside RStudio using the Build All button, open a Git BASH terminal, cd to your R Project directory that contains the Makefile, and run make there.

jennybc commented 7 years ago

Would some who can run a makefile in the shell but not via, e.g., the RStudio Build All button do this for me?

I'd like to see the PATH reported in a Git Bash shell. Get with echo %PATH% (or someone can correct me -- I don't have Windows).

vs

PATH reported in a shell launched from RStudio via Tools > Shell

vs PATH reported from R Console inside RStudio. Get with Sys.getenv("PATH").

ksedivyhaley commented 7 years ago

@samhinshaw I have en.rmd and fr.rmd in my directory (the same folder as the project file and Makefile). It is still not identifying %.html, either in RStudio or in the shell.

I've tried this in RStudio with Build All, in the shell from RStudio, and the Git Bash shell.

When I use a command to make en.html specifically:

en.html: en.rmd en.tsv
    Rscript -e 'rmarkdown::render("en.rmd")'

Build All gets the Curl error.

Calling make en.html in the shell doesn't produce the curl error, but does produce

Error in loadNamespace(name) : there is no package called 'rmarkdown' Calls: :: ... tryCatch -> tryCatchList -> tryCatchOne -> Execution halted make: *\ [en.html] Error 1 rm en.length en.length.count

Git Bash shell fails with:

Rscript -e 'rmarkdown::render("en.rmd")' make: Rscript: Command not found make: *\ [en.html] Error 127

ksedivyhaley commented 7 years ago

@jennybc

Git Bash

image

RStudio Shell:

image

$ env|grep PATH HOMEPATH=\Users\7ks42 MANPATH=/mingw64/share/man:/usr/local/man:/usr/share/man:/usr/man:/share/man: PATH=/c/Users/7ks42/bin:/mingw64/bin:/usr/local/bin:/usr/bin:/bin:/mingw64/bin:/usr/bin:/c/Users/7ks42/bin:/c/Program Files/R/R-3.3.1/bin/x64:/c/Rtools/bin:/c/Rtools/mingw_32/bin:/c/Program Files/Broadcom/Broadcom 802.11 Network Adapter:/c/WINDOWS/system32:/c/WINDOWS:/c/WINDOWS/System32/Wbem:/c/WINDOWS/System32/WindowsPowerShell/v1.0:/c/Program Files (x86)/ATI Technologies/ATI.ACE/Core-Static:/c/Program Files/Lenovo/Bluetooth Software:/c/Program Files/Lenovo/Bluetooth Software/syswow64:/c/Program Files (x86)/Skype/Phone:/cmd:/c/Users/7ks42/AppData/Local/Microsoft/WindowsApps:/c/Program Files (x86)/GnuWin32/bin:/bin:/c/Program Files/RStudio/bin/msys-ssh-1000-18:/usr/bin/vendor_perl:/usr/bin/core_perl EXEPATH=C:\Program Files\Git\bin PATHEXT=.COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF;.WSH;.MSC PKG_CONFIG_PATH=/mingw64/lib/pkgconfig:/mingw64/share/pkgconfig ACLOCAL_PATH=/mingw64/share/aclocal:/usr/share/aclocal INFOPATH=/usr/local/info:/usr/share/info:/usr/info:/share/info:

RStudio Console:

image

C:\Program Files\R\R-3.3.1\bin\x64;c:\Rtools\bin;c:\Rtools\mingw_32\bin;C:\Program Files\Broadcom\Broadcom 802.11 Network Adapter;C:\WINDOWS\system32;C:\WINDOWS;C:\WINDOWS\System32\Wbem;C:\WINDOWS\System32\WindowsPowerShell\v1.0\;C:\Program Files (x86)\ATI Technologies\ATI.ACE\Core-Static;C:\Program Files\Lenovo\Bluetooth Software\;C:\Program Files\Lenovo\Bluetooth Software\syswow64;C:\Program Files (x86)\Skype\Phone\;C:\Program Files\Git\cmd;C:\Users\7ks42\AppData\Local\Microsoft\WindowsApps;C:\Program Files (x86)\GnuWin32\bin

sjackman commented 7 years ago

Calling make en.html in the shell doesn't produce the curl error, but does produce Error in loadNamespace(name) : there is no package called 'rmarkdown'

To troubleshoot this error, report the output of Rscript -e ".libPaths()" in Git BASH and .libPaths() in RStudio.

sjackman commented 7 years ago

Git Bash shell fails with: make: Rscript: Command not found

To hopefully fix this error, in Git BASH run

which -a Rscript
PATH="/c/Program Files/R/R-3.3.1/bin/x64:$PATH"
which -a Rscript
ksedivyhaley commented 7 years ago

@sjackman

Git Bash

image

RStudio Console

.libPaths() [1] "C:/Users/7ks42/Documents/R/win-library/3.3" [2] "C:/Program Files/R/R-3.3.1/library"

jennybc commented 7 years ago

@sjackman Will you write this down somewhere? The discrete problems and, when we have them, solutions. We encounter the same Windows problems every year.

  1. Inconsistencies with PATH in Git bash vs RStudio > shell vs system command via R Console inside RStudio. I think there are even further variations depending on where Rtools ended up on the PATH. Manifests as (in)ability to find unix tools or to find the expected version.
  2. Inconsistencies with .libPaths() in R called via Rscript in the various shell vs in an R process owned by RStudio.
sjackman commented 7 years ago

@jennybc Yes, will do, once we figure out the proper solution.

@ksedivyhaley That's progress. Try

PATH="/c/Program Files/R/R-3.3.1/bin/x64:$PATH"
Rscript -e ".libPaths()"
Rscript -e ".libPaths('C:/Users/7ks42/Documents/R/win-library/3.3'); rmarkdown::render('en.rmd')"
ksedivyhaley commented 7 years ago

@sjackman

image

From the RStudio Console:

rmarkdown::pandoc_available() [1] TRUE rmarkdown::pandoc_version() [1] ‘1.17.2’

sjackman commented 7 years ago

@ksedivyhaley Progress… (I'll keep saying that…) In RStudio please report the output of

Sys.which(c("curl", "make", "pandoc", "Rscript"))
jennybc commented 7 years ago

@sjackman (or anyone): is anyone using the new linux bash shell for windows 10? I'm curious how this annual train wreck unfolds there.

sjackman commented 7 years ago

I believe @coatless is running Bash on Ubuntu on Windows. See https://github.com/STAT545-UBC/Discussion/issues/397#issuecomment-259249495 Let's move this Windows discussion over to that issue.

sjackman commented 7 years ago

@jennybc Bash on Ubuntu on Windows is fantastic, I declare. 🎉