forestgeo / learn

Links to interesting articles, videos, tutorials, tips, and more
5 stars 8 forks source link

Best Practices for Scientific Computing #17

Open maurolepore opened 7 years ago

maurolepore commented 7 years ago

Recommended by Helene Muller-Landau:

gabrielareto commented 7 years ago

this is a list of suggested basic good practices in real projects like those of CTFS:

version control for data and code

automation

separated directories in self-contained projects

document

good practices for coding or programming are a different story, see issue #17 . But, in general, a typical script could look like this:

1) erase everything 2) tell where the stuff is 3) set the seed and maybe other general parameters (e.g. date for version control) 4) load functionality 5) load data 6) review and clean data 7) do interesting things 8) keep results

this script-level structure is much more subjective and project-dependent than the other things, though.

What do you think?

gabrielareto commented 7 years ago

another one: "separate slow code from fast code".

this is linked to the separation between processed and raw data, and justifies the existence of processed data. If all the code is fast, processed data should exist only temporarily as objects while the script is running, and not elsewhere in the form of files.

maurolepore commented 7 years ago

In my opinion, the single most useful book for data scientists

R for Data Science, by Hadley Wickham and Garrett Grolemund.

data-science

This book will teach you how to do data science with R: You’ll learn how to get your data into R, get it into the most useful structure, transform it, visualise it and model it. In this book, you will find a practicum of skills for data science. You’ll learn how to use the grammar of graphics, literate programming, and reproducible research to save time. You’ll also learn how to manage cognitive resources to facilitate discoveries when wrangling, visualising, and exploring data.

The way to manage and share code and data that is most widely used among R users

File > New Project... > New Directory > R Package image

Package writing in RStudio (36' webinar)

R packages, by Hadley Wickham.

Packages are the fundamental units of reproducible R code. They include reusable R functions, the documentation that describes how to use them, and sample data. You’ll learn how to turn your code into packages that others can easily download and use.

Classic books that continue to be relevant today

The Pragmatic Programmer, by by Andrew Hunt and David Thomas.

...illustrates the best practices and major pitfalls of many different aspects of software development.

Code Complete, by Steve McConnell.

Widely considered one of the best practical guides to programming. (...) Capturing the body of knowledge available from research, academia, and everyday commercial practice, McConnell synthesizes the most effective techniques and must-know principles into clear, pragmatic guidance.

(My ever growing list of puotes about good practice: https://goo.gl/wQZJQj.)

gabrielareto commented 7 years ago

I think you refer to this, "Workflow of statistical data analysis" by Oliver Kirchkamp https://www.kirchkamp.de/oekonometrie/pdf/wf-screen2.pdf

maurolepore commented 7 years ago

Thanks Gabriel for keeping me honest. I take my comment back to avoid confusion (I removed it).