datasnakes / beri-isc-proposal

An isc proposal to the R Consortium to elicit their support for beRi.
https://tinyurl.com/y7l7jch8
2 stars 0 forks source link

Some comments #24

Open DannyArends opened 5 years ago

DannyArends commented 5 years ago

Dear all I found your proposal via the Reddit post, and would like to make some comments, about things that are not really correct in my opinion

While R has CRAN and Bioconductor, it doesn't have a way to install packages globally or locally via the command-line.

This statement is untrue, it is trivial to install R packages from CRAN from the command line:

Rscript -e "install.packages('qtl', repos='https://cloud.r-project.org')"

This can be combined with the lib option to specify where the package needs to be installed (e.g. in the home folder).

Rscript -e "install.packages('qtl', lib="~/Rpacks/", repos='https://cloud.r-project.org')"

The primary barriers for Docker are its complexity and its requirement for administrative privileges, which stunt its adoption in settings where time is limited for users and/or system administrators (e.g. academia).

It is again trivial to setup docker to allow users to use it without requiring sudo rights, see https://docs.docker.com/install/linux/linux-postinstall/#manage-docker-as-a-non-root-user Additionally in your proposal you plan to create 3 additional command line tools, which have to be installed by the system administrator.

Anyway, I do not really see how this project will help with reproducibility of scientific research compared to advertising people to use docker. The isolation provided by your proposed tools is paper thin (everything is in a folder in the home directory), and does not provide any real isolation from the underlying OS (e.g. the admin decides it's time for a new version of gcc, now the entire pipeline in my home folder is broken). How does this deal with packages that have gone missing (e.g. in bioconductor it happens that packages are discontinued), how does this solve compilation issues for different compiler versions, etc

Also the choice of python for implementation is sketchy to me, it again adds another dependency which needs to be installed (python is not standard on windows/macosx) and maintained (also python updates).

There are some more minor things:

(1) renv, a virtual environment manager for R; rinse, an R installation and R version manager; and (3) rut, an R utility tool for installing

the (2) is missing in front of the word rinse

Kind regards, Danny

sdhutchins commented 5 years ago

@DannyArends Thanks for taking the time to read this and provide critiques.

I think we can certainly hone in on some of what you said to make it clear what we want.

If you're installing a specific version of a package, a package from bioconductor, or from github, the length of that command and exactly what to type becomes increasingly longer and more complex. It's a barrier, imo, to someone either just learning the language or for ease of use when comparing it to pip or npm. We need to make that clearer.

In our experiences, the adoption of docker on HPC systems has been very slow and admins aren't sure how it works (plus the security vulnerabilities). That's something we definitely need to address.

Mostly all of the bioinformaticians I've come in contact with on reddit and other communities aren't using docker to manage their workflows. The reason why may not be what we think. We can certainly do more work to find out.

Thank you so much for your honesty. We need this kind of feedback to make the tool better.

grabear commented 5 years ago

@DannyArends Thank you so much for you feedback! It really is helpful to get criticism like this. Going forward today I will consider your arguments and use them to change some of the wording in our proposal.

It is again trivial to setup docker to allow users to use it without requiring sudo rights, see https://docs.docker.com/install/linux/linux-postinstall/#manage-docker-as-a-non-root-user

Could you perhaps elaborate on your experience with docker? Specifically, could you tell us what kind of system you have used docker with, and what type of setting this was in (academic, private, or health care industry)?

In our preface we've actually linked to the Docker Daemon Attack Surface. The link you posted actually shows the warning.
image

Anyway, I do not really see how this project will help with reproducibility of scientific research compared to advertising people to use docker.

This is not to replace using docker, or even take away from it. We actually say that in our proposal. Docker is by far the BEST solution for reproducibility, because it does isolate your workflow from the OS. beRi has it's use cases so we'll definitely make that clearer.

The isolation provided by your proposed tools is paper thin (everything is in a folder in the home directory), and does not provide any real isolation from the underlying OS (e.g. the admin decides it's time for a new version of gcc, now the entire pipeline in my home folder is broken). How does this deal with packages that have gone missing (e.g. in bioconductor it happens that packages are discontinued), how does this solve compilation issues for different compiler versions, etc.

Also the choice of python for implementation is sketchy to me, it again adds another dependency which needs to be installed (python is not standard on windows/macosx) and maintained (also python updates).

Additionally in your proposal you plan to create 3 additional command line tools, which have to be installed by the system administrator.

Python and python packages do not need any sudo permissions. They can both be installed by the user. If python is already installed on the system, you can install python packages in the user's home directory without any elevated permissions, which is as simple as:

pip install -u beRi

How do you use Python in your workflows?

@DannyArends Again thank you for your feedback. We will certainly be better off with the comments you've made.

DannyArends commented 5 years ago

Could you perhaps elaborate on your experience with docker? Specifically, could you tell us what kind of system you have used docker with, and what type of setting this was in (academic, private, or health care industry)?

Used it to setup containerized webfacing applications, such as GeneNetwork and GeneNetwork2 in an academic setting. I also work with different clusters, and many of these indeed have an issue with setting up docker. It does take some time to convince people that it is worth wile, but anyone that has ever faced switching PostDocs / PHD students knows how much effort it can be to recoop technical debt, and can be more easily convinced. SystemAdmins tend to be 'support' for research, and they'll listen when multiple profs come knocking at their doors wanting to install docker... leverage comes from above ;-)

The environment we suggest is really no different from the way conda handles the R environment. Other than docker or some other form of containerization, there is really no way to provide isolation from the OS. If there are other ways, then please let us know!

We recently switch our docker infrastructure to GNU GUIX ! complete version and dependency control and reproducible binary builds.. see https://www.gnu.org/software/guix/ it abstracts away the OS, but can do so in user-mode as well. and more important FOSS and GNU !

Python and python packages do not need any sudo permissions. This is only true for Linux, not for mac OSX and Windows. Of course most clusters use Linux atm, but it is still something to consider.

Additionally, this statement is also true for R, R can be installed in the home folder, and packages can be installed there as well (even command line, as I showed in the first example). Changing from one R version to another is as simple as setting R_HOME to the correct location: R_HOME=~/R/3.3.0/ or

R_HOME=~/R/2.8.0

I don't really see the need for creating three separate tools, written in another language to manage this.

How do you use Python in your workflows?

I generally don't.

R is my main language for scripting/prototyping and my goto pipeline language (I've written a paper about how we should be using R as the lingua franca in life sciences.

C/C++ for number crunching,

PHP/Python for web facing server backends.

The D language (https://dlang.org/) for just fun stuff

anyway, to sum up I read the proposal because I was interested. But I think you should not make claims about reproducibility at all:

The tools created for beRi will aim to support reproducibility by allowing virtualization and standardization for data analysis

This is more or less a false claim since your not dealing with that, and be it seems there are more 'false claims', such as: about not being able to install R packages from the command line and/or that the primary barriers for Docker is its complexity, which isn't really true. The primary barrier for Docker is lazy sys admins, since anyone who can write a makefile can make a docker image. There are more sketchy claims in the proposal, without any peer reviewed literature to back up these claims.

Additionally, I don't see who is going to maintain the beri tools long term, which needs to happen outside of the R community, since its written in python.

Also the literature references are a joke (1 citation to stackoverflow šŸ¤¦ā€ā™‚ļø).

Anyway, sorry if I sound a bit harsh... it's a nice initial draft proposal but imho a very long way away from being a good proposal.

Danny

grabear commented 5 years ago

@DannyArends That's ok Danny! We understand it may look a little novice. We have only had about 2 weeks to write the proposal, and I've done about 66% of the writing and managing myself. So we have been short on time.

These comments are seriously amazing, and I think I can speak for all of us (@datasnakes/beri-leads), when I say that we really appreciate your feedback.

While harsh, it's a fair review! If you don't mind I'm going to post back on this issue, after we've made some changes. I think we actually already covered some of your issues in the next draft šŸ‘

grabear commented 5 years ago

We've updated our proposal if you want to read it again @DannyArends šŸ˜…

https://github.com/datasnakes/beri-isc-proposal/blob/master/proposal.md

sdhutchins commented 5 years ago

We've updated our proposal if you want to read it again @DannyArends

https://github.com/datasnakes/beri-isc-proposal/blob/master/proposal.md

There are still some grammatical errors, but we'll have that worked out soon.