Open lucamlouzada opened 1 month ago
I have reviewed the different alternatives for package dependency handling. It seems the main options are:
conda_env.yaml
or requirements.txt
. This would remain optional for outside users who want to install packages directly, but would be the standard in lab projects.conda
may become slow on large projects as it relies on a global registry of environments which can get cluttered.Renv
Renv
with python specific tools such as venv
is not ideal as that creates two environments and you have to alternate between each one before running a script in each language. This seems to be the way things were done in the initial versions of this template (see template_archive issue #95), but this requires the virtual environments for each language to be activated in the make.sh
scripts (or in the run_xx.sh
scripts).Renv
can deal with both Python and R, which is a simpler approach if we want to avoid conda
but want to keep the make.sh
scripts as simple as possibleRenv
is not as good as conda
, but I have tested and it seems decent. Renv
creates a specific folder in the root of the module, with a subfolder for python. You can activate the python environment and still use R as usual, without having to switch between two environments. Renv
is that there seem to be problems when users collaborate with different operating systems, because it relies on the system to compile packages from source, while conda
provides pre-built binaries (see here, here, and here). However I haven't tested these issues myselfI also reviewed other alternatives, such as Docker and Posit, but these don't seem to suit our purposes well. Docker may still be an alternative in specific projects that become too large for conda
to handle, and we could add instructions about it on the wiki. The one option that we could consider is Pixi, which is a new tool built on conda
meant to solve many of the problems of conda
. It seems to be growing and receiving positive reviews (see here), but it's still a new tool and seems to be a little more complex to learn in the beginning than conda
.
My sense is that even though conda
may have issues, it is the best alternative. These issues seem to only arise when projects get too large, and conda
is still the most adopted tool around. Renv
could be preferred in projects where R is the main language, but I think conda
is a more robust solution to be recommended as the default template for users outside the lab who may be working in different operating systems. Renv + venv
is an intermediate solution that avoids some problems but makes the template a little more complex for users. In the end of the day it's a trade-off between what kind of issues we want to avoid the most. We could also write guidelines for more than one of these alternatives and users can choose depending on the kind of project.
If we agree on which is the best option, I can work on setting up a simple template environment. I am also happy to keep investigating and run any more tests if you have questions or suggestions.
@Xingtong-Jiang @linxicindyzeng @ShiqiYang2022
Thanks @lucamlouzada.
A few thoughts:
An advantage of moving from the old make.py
template to the shell-based template is that many of our projects won't require Python. I think whatever we do we should make using conda
optional rather than a requirement.
It seemed to me like a main attraction of Renv
is that it can scan dependencies automatically, rather than users having to update the requirements.txt
file manually. That seems like a big advantage to me. If we were using Renv
, one solution to the operating system issue is to have each user compile the environment themselves when they first clone the repo.
I don't think we need to have a single recipe that we use universally. We just want some good defaults. I like the idea of using Renv
for managing R dependencies and conda
for managing Python dependencies. But if we end up finding Renv
too clunky or not robust we can definitely use conda
for everything.
We need to figure out the best approach for Stata. So far what I've found works well is to put necessary .ado
files in /lib/stata/
and then have the Stata scripts set so they only look at that directory for .ado
files (to prevent users from inadvertently relying on add-on Stata commands that are only installed on their local machines).
We should also think about what is the best solution for "lightweight" repos that only have a few dependencies and where conda
and Renv
might be overkill. We could go back to having a setup.sh
script that installs everything, for example.
We should also think about the best solution for "lightweight" repos that only involv
This issue is part of an effort to implement substantive improvements to the lab template, as discussed in https://github.com/gentzkow/GentzkowLabTemplate/issues/16.
In this issue, the goal is to evaluate different alternatives for managing software and package dependencies and choose the best approach. The main points to be addressed per the decision in plans for next steps are:
Decision:
renv
for managing R packages andvenv
(or something similar) for Python. Play with the alternatives yourself and evaluate its pros and cons.I am assigning myself to work on this. I will start by researching different alternatives and evaluating pros and cons. After we have settled on the preferred approach, I will implement and test.