Revisit package dependency handling

lucamlouzada commented 1 month ago

This issue is part of an effort to implement substantive improvements to the lab template, as discussed in https://github.com/gentzkow/GentzkowLabTemplate/issues/16.

In this issue, the goal is to evaluate different alternatives for managing software and package dependencies and choose the best approach. The main points to be addressed per the decision in plans for next steps are:

Decision:

For new users who want to use the template out of box, they don't need to know anything about that. Provide clear instruction in ReadME on how to install related packages.
For people who don't want to use conda, they need know what packages need to be installed and then install them (but definitely we cannot garantee after that we don't have conflicts).
Investigate whether we still need to use conda or not. The alternative for this is using renv for managing R packages and venv (or something similar) for Python. Play with the alternatives yourself and evaluate its pros and cons.

I am assigning myself to work on this. I will start by researching different alternatives and evaluating pros and cons. After we have settled on the preferred approach, I will implement and test.

lucamlouzada commented 1 month ago

I have reviewed the different alternatives for package dependency handling. It seems the main options are:

Conda:
- Setup a template environment similar to what was done in the old template, using something like a conda_env.yaml or requirements.txt. This would remain optional for outside users who want to install packages directly, but would be the standard in lab projects.
- The main problem with this alternative is that, as discussed, conda may become slow on large projects as it relies on a global registry of environments which can get cluttered.
- Another issue is that support for R is not perfect, and therefore in projects where only R is used it might make more sense to use Renv
Renv + venv
- Combining Renv with python specific tools such as venv is not ideal as that creates two environments and you have to alternate between each one before running a script in each language. This seems to be the way things were done in the initial versions of this template (see template_archive issue #95), but this requires the virtual environments for each language to be activated in the make.sh scripts (or in the run_xx.sh scripts).
Renv only
- Renv can deal with both Python and R, which is a simpler approach if we want to avoid conda but want to keep the make.sh scripts as simple as possible
- Python support in Renv is not as good as conda, but I have tested and it seems decent. Renv creates a specific folder in the root of the module, with a subfolder for python. You can activate the python environment and still use R as usual, without having to switch between two environments.
- The main issue with Renv is that there seem to be problems when users collaborate with different operating systems, because it relies on the system to compile packages from source, while conda provides pre-built binaries (see here, here, and here). However I haven't tested these issues myself

I also reviewed other alternatives, such as Docker and Posit, but these don't seem to suit our purposes well. Docker may still be an alternative in specific projects that become too large for conda to handle, and we could add instructions about it on the wiki. The one option that we could consider is Pixi, which is a new tool built on conda meant to solve many of the problems of conda. It seems to be growing and receiving positive reviews (see here), but it's still a new tool and seems to be a little more complex to learn in the beginning than conda.

My sense is that even though conda may have issues, it is the best alternative. These issues seem to only arise when projects get too large, and conda is still the most adopted tool around. Renv could be preferred in projects where R is the main language, but I think conda is a more robust solution to be recommended as the default template for users outside the lab who may be working in different operating systems. Renv + venv is an intermediate solution that avoids some problems but makes the template a little more complex for users. In the end of the day it's a trade-off between what kind of issues we want to avoid the most. We could also write guidelines for more than one of these alternatives and users can choose depending on the kind of project.

If we agree on which is the best option, I can work on setting up a simple template environment. I am also happy to keep investigating and run any more tests if you have questions or suggestions.

@Xingtong-Jiang @linxicindyzeng @ShiqiYang2022

gentzkow commented 1 month ago

Thanks @lucamlouzada.

A few thoughts:

An advantage of moving from the old make.py template to the shell-based template is that many of our projects won't require Python. I think whatever we do we should make using conda optional rather than a requirement.
It seemed to me like a main attraction of Renv is that it can scan dependencies automatically, rather than users having to update the requirements.txt file manually. That seems like a big advantage to me. If we were using Renv, one solution to the operating system issue is to have each user compile the environment themselves when they first clone the repo.
I don't think we need to have a single recipe that we use universally. We just want some good defaults. I like the idea of using Renv for managing R dependencies and conda for managing Python dependencies. But if we end up finding Renv too clunky or not robust we can definitely use conda for everything.
We need to figure out the best approach for Stata. So far what I've found works well is to put necessary .ado files in /lib/stata/ and then have the Stata scripts set so they only look at that directory for .ado files (to prevent users from inadvertently relying on add-on Stata commands that are only installed on their local machines).
We should also think about what is the best solution for "lightweight" repos that only have a few dependencies and where conda and Renv might be overkill. We could go back to having a setup.sh script that installs everything, for example.
We should also think about the best solution for "lightweight" repos that only involv

gentzkow / GentzkowLabTemplate

Revisit package dependency handling #21