How to write a science module

freeman-lab commented 8 years ago

The goal here is to collect thoughts on principles for writing modular, composable, and reusable code in science. The idea is heavily influenced by @substack's fantastic "How I write modules", and wanting something like it that's focused on science, and maybe one level more general. @blahah recently pointed us to the this bioinformatics document which has a similar spirit.

To kick things off, I've listed a few of my own thoughts. This is just my perspective, others should comment on these and/or add their own ideas! My hope is that we can generate a bunch of ideas here and then turn it into a curated document or maybe an interactive tutorial.

Defining a problem

Before starting anything, it's good to define the problem you want to solve. I think a lot of problems can be categorized along two dimensions: complexity and generality. Complexity reflects the breadth and scope and potential impact of what you're trying to tackle, and generality is the set of domains in which your solution can be applied.

Here's a little picture of the space

            |                                   
            |    ---         --+  
            |
            |
complexity  |
            |
            |    -++         +++        
            |      
      (0,0)  ------------------------
                  generality

In my experience, a lot of code in science is in the upper left: it tries to do something incredibly complex, but for a very specific problem domain. A classic example is setting out to all at once build "the analysis pipeline for my lab". This might be tempting because it solves all your problems, and keeps it all in one place. But there are many disadvantages with this approach! It will be hard for any other projects or labs to use your solutions (which they should!), and it will be hard for you to maintain or extend your solutions as your science evolves (which it will!).

Another common pattern, in the upper right, is to take on a project that's big and complex and applies to many domains. These are big "frameworks" or "libraries", and might include things like "general-purpose machine learning" or "image processing". Many such libraries exist in science, and some are useful! But there are some important disadvantages to this approach. A monolithic project can be hard to extend or change, especially if some parts change more frequently than others — which is why big projects are often backed by huge teams. Big projects can also be hard to test, because they have many interrelated pieces. And anyone who wants to use or help work on your project needs to deal with the whole thing, even if they just want to use, or improve, a small piece.

In the other extreme, the lower left, we can design modules that are very simple and have very small domains. This hyper-specificity is sometimes necessary, and if the complexity is low, it'll be easy for you to document and maintain. But without generality, it's unlikely that anyone else will be able to use your module!

To me, the best kind of module for science aims at the lower right: not too complex and narrow in scope, but otherwise as general as possible. By limiting the complexity, you ensure that it stays manageable and maintainable. But by pushing for generality, and avoiding domain-specific jargon, you ensure that other people can use and compose your modules with theirs, including both inside and outside of science!

When making sure my project isn't getting too complex, I like to ask myself: can I solve this problem in days/weeks or will it take months/years? How hard will it be to modify in the future? When making sure my idea is general, I try to ask myself: can another lab use my solution? Can someone outside of my particular field of science use, and understand, what I've done?

Is your problem already solved?

Once you've defined your problem, find out if someone's already solved it! This can be hard, especially if you are new to a language. For most languages, there is a website that lets you directly search existing packages: cran (for R), pypi (for Python), npm (for node.js), or pkg.julialang (for julia). But curated lists are even more useful. These awesome lists are a good start, but some are still quite general:

Python [1]
Node.js [1] [2] [3]
R [1]
Julia [1]
Machine learning [1]

_We should assemble a cross-language awesome list of modules for science!_

If someone has kind of solved your problem, but not exactly, and made it available as open source, this is a great opportunity to contribute! Maybe with some small changes their solution would work for your problem. Open an issue on their GitHub repository to start a friendly conversation! If the changes are too large and out of scope, that's ok too! You could fork their project and start from there, or start from scratch. Just like in science, it's ok to have a couple different takes on the same problem — we just don't want every single lab reinventing everything from scratch.

Think like a user

Once you've picked a problem and decided to write some code, don't start implementing your solution quite yet! Instead, write a small example of how someone would use your module, to define how people will interface with it.

Put yourself is the shoes of someone unfamiliar with your approach: what would be an intuitive experience for them? Actually write snippets that use the code you want to develop. If you're writing something for data analysis, get some example data and show what you'd do with it. If you're writing a simulation, show how you'd invoke it and what the result would be. If you get stuck, find examples of modules or libraries you like to use, and draw inspiration! When doing machine learning or statistics, for example, I often look at the design and user experience of scikitlearn.

What else?

Do these seem useful? What other topics do people feel are important? What other resources could we provide? Let us know in the comments.

olgabot commented 8 years ago

Here's a question: I currently have a project somewhere in the upper-center (flotilla) which is far too large and complex, but still does some useful stuff (for me at least). I've been slowly chipping away at different submodules and have moved some of the code out to smaller, much more manageable packages (anchor, outrigger, and astrolabe), but it's still hard to maintain these smaller ones and build their tests and everything out. It's been a frustrating experience

Given that many software projects are NOT in the lower right corner, but presumably would like to be there, how do you suggest scientist-developers (I just made that up) work with their existing codebases to successfully break them off into smaller, more manageable projects?

GrantRVD commented 8 years ago

This is just one more resource, but since you mentioned looking at the scikit-learn module for design inspiration, I thought this scikit "template" repository could also serve as a good starting point for some module makers, whether or not they want to officialli

blahah commented 8 years ago

@olgabot flotilla looks very nice! I've faced the same issue before, and have found that most of the fiddly stuff comes from (1) the infrastructure, (2) redesigning the API, and (3) actively developing multiple interdependent packages at once.

For (1), I use a project generator like yeoman to create my personal favourite style of directory structure for the language in question, fill in the required metadata files, create the testing setup etc. Then I just copy over the code/tests from the larger project. In general, automating the tedious tasks becomes more important when you start making many small tools.

For (2), I don't really have any special guidance - I just use normal UX design principles and practises - user stories, etc.

For (3) I usually symlink in the under-development repos for the dependencies to the place the package manager for the language in question looks for them, or point the manager at a dev branch for the dependencies.

codeforscience / learning