Closed thompsonmj closed 2 months ago
BTW is there a reason we're invoking the python interpreter through its path semi-hardcoded and into the conda environment, as opposed to simply activating the conda environment first (which presumably needs to be done anyway)?
As an aside, conda environments can get quite big in terms of storage they take up, and it's therefore generally not a good idea to place them into one's home directory on a shared HPC. (Home directories on shared HPCs typically have quite a limited storage quota, and group or project directories is normally where a lot more storage is available or at least where it can be added on demand.)
@samuelstevens, this should work as described in updated directions in PR #14. If you encounter any issues or something seems off, please make suggestions there too.
@samuelstevens @thompsonmj @egrace479, just curious, has this bug been resolved by this PR?
Apologies for the delay @penfever. There is still a lingering issue producing the catalog where common names are not reproduced as expected. We hope to resolve it within the next week.
Issue with null common names resolved.
@penfever, I believe this completes the fixes needed to address #13.
Could you please attempt to build the dataset from this branch following these instructions?
Please reach out if there are additional questions and also if you are able to complete the reconstruction. It will be helpful to know that it can work in someone else's hands.
@thompsonmj, following this step in our directions:
conda env create -f requirements-training.yml --solver=libmamba -y
conda activate bioclip-train
pip install -e .
we wind up uninstalling and re-installing a bunch of packages like torch
and huggingface_hub
to get older versions (those listed in requirements.txt instead of the ones in requirements-training.yml. Seems potentially redundant to do both. Should we instead have a tol
environment to do pip install -e .
in?
... we wind up uninstalling and re-installing a bunch of packages ...
We may want to consider cleaning up the environment ...
Yes, we can address this in another PR.
Reproduction seems to be working on our end.
Right, based on my latest full dataset reproduction test run, all splits end up as expected.
@thompsonmj please check the version I updated. My copy is still running on OSC, but it does seem to be working now. I also updated the directions in #14 to match these changes.