Closed arisp99 closed 2 years ago
@arisp99 I tried specifying versions before but did not work quite well. In almost all cases, if you have such a big software list and try to specify versions, they will not be compatible. If versions are not specified, mamba does a good job of resolving dependencies and installing the packages. The idea that if we specify versions, each build will be exactly the same, hence better reproducibility, is very appealing but did not work for me before. What do you think? The advantages overweigh the troubles in your opinion?
If this is more about documenting which versions are installed, we can create the installed packages after each build by conda list
or something similar.
So, with a new build initially you let conda resolve, but when you freeze a version can you lock the versions of the underlying code (or at list dump a list that we then add so someone could force the versions) This would provide real reproducability. The other option is to be able to build proprietary stuff like bcl2fastq outside and then add to another directory.
That is a very valid point @aydemiro—I hadn't thought of that. But you are correct I can totally see instances when you want to update a package and then you have dependency conflicts. To be honest, I am usually more of a fan of updating software as there are new features and bug fixes that be useful.
Thinking about what @JeffAndBailey proposed, I see that you can install packages using a requirements.txt
file using mamba install --file requirements.txt
. Something that we could do is that on the initial build let mamba
resolve all conflicts and when our build is complete, save a requirements.txt
somewhere that lists all of the package versions. To do this we can use mamba list --export
. Users may then be able to rebuild the container using this saved requirements.txt
file.
We could even have a check in the definition file to see if requirements.txt
exists. If it does, then install using the file, whereas otherwise install and let mamba resolve conflicts.
Some quick questions thinking about this more:
%files
section.Let's see if we can download and build externally bcl2fastq and install it as a working version with any need libraries or accessary files. if that is possible then really our fixed builds san bcl2fastq will be fine for reproduciblility.
I was planning to move the conda installation to an environment based system where we have an environment.yml file for the base environment in the repository, instead of listing all packages without the versions in the definition file. We can then employ something like this:
environment_versioned.yml
exists
mamba env create -f environment_versioned.yml
mamba env create -f environment.yml
conda activate base
mamba env export > environment_versioned.yml
As for the bcl2fastq issue, I agree that we should explore building the software outside and providing the binary to the container as a binding. However, this is a compiled c++ program and how to create a portable binary is beyond my capabilities at the moment. Nick is probably the best person to consult on this.
@arisp99 I think we have to use the %setup section for copying from the container to the host and %files
for copying to the container from the host.
I was planning to move the conda installation to an environment based system where we have an environment.yml file for the base environment in the repository.
This seems similar to just using a requirements.txt
file. Do you think an environment-based system would be more beneficial, @aydemiro?
@arisp99 I think we have to use the %setup section for copying from the container to the host and
%files
for copying to the container from the host.
Yes! Yes looks right. So hashing this out a bit further, in our %files
section we would have the following line of code:
%files
# could be either requirements or environment
environment* /opt/conda
Then as you write:
- If file
environment_versioned.yml
existsmamba env create -f /opt/conda/environment_versioned.yml
- If versioned file doesn't exist:
mamba env create -f /opt/conda/environment.yml conda activate base mamba env export > /opt/conda/environment_versioned.yml
Lastly, in the %setup
section, we have
cp ${SINGULARITY_ROOTFS}/opt/conda/environment_versioned.yml environment_versioned.yml
As for the bcl2fastq issue, I agree that we should explore building the software outside and providing the binary to the container as a binding. However, this is a compiled c++ program and how to create a portable binary is beyond my capabilities at the moment. Nick is probably the best person to consult on this.
I agree with all this re the bcl2fastq installation. It would awesome if you could just plop the binary into the container. I think that it makes sense to address this as a separate issue for now as it seems a bit complex... For now, let's try to finalize if we want a requirements.txt
or an environment.yml
file to move ahead and revisit bcl2fastq in a separate issue.
I was planning to move the conda installation to an environment based system where we have an environment.yml file for the base environment in the repository.
This seems similar to just using a requirements.txt file. Do you think an environment-based system would be more beneficial, @aydemiro?
I explore this question a bit more and it seems that an environment.yml
is actually better as it gives us more options to configure the conda environment. We can specify the channels we want to install packages from and even install pip packages using this framework.
I have now configured MIPTools to install mamba packages using an environment file. In the definition file, we first check to see if an environment_versioned.yml
file exists. If it does, we use that for installation. Otherwise, we install given our clean environment.yml
file that does not contain the package versions for software.
One important thing to note is that we are actually unable to copy files to the host during the building of our container. The %setup
section is executed before the %post
section so we will not have installed our packages yet. Given this, I think the best course of action is to include a note somewhere in the documentation indicating that the user can copy the environment_versioned.yml
from the container using singularity exec
:
singularity exec miptools.sif cat /opt/environment_versioned.yml > environment_versioned.yml
and that if this environment_versioned.yml
file is present in the directory when building, it will be used to specify package versions for software.
@aydemiro and @JeffAndBailey, if you have no additional comments, I will go ahead and merge this PR early next week.
This PR specifies the version number for most of the installed software in the MIPTools container.
Closes #31.
Checklist for software
Installed via
apt-get
Installed via
git
msa2vcf
(throughjvarkit
)vt
MIPWrangler
elucidator
Installed via
wget
conda
(viaminiconda
)miniconda
, the version ofconda
is automatically updated. However, as we usemamba
as our package manager, it is fine to leave this as is. We do specify the version number formamba
.Installed via
install.packages()
magrittr
McCOILR
rehh
Installed via
conda
mamba
Installed via
mamba