Debugging TOL-10M dataset reproduction process

Imageomics / bioclip

This is the repository for the BioCLIP model and the TreeOfLife-10M dataset [CVPR'24 Oral, Best Student Paper].

https://imageomics.github.io/bioclip/

Other

166 stars 14 forks source link

Debugging TOL-10M dataset reproduction process #16

Closed thompsonmj closed 2 months ago

egrace479 commented 4 months ago

@thompsonmj please check the version I updated. My copy is still running on OSC, but it does seem to be working now. I also updated the directions in #14 to match these changes.

hlapp commented 4 months ago

BTW is there a reason we're invoking the python interpreter through its path semi-hardcoded and into the conda environment, as opposed to simply activating the conda environment first (which presumably needs to be done anyway)?

As an aside, conda environments can get quite big in terms of storage they take up, and it's therefore generally not a good idea to place them into one's home directory on a shared HPC. (Home directories on shared HPCs typically have quite a limited storage quota, and group or project directories is normally where a lot more storage is available or at least where it can be added on demand.)

egrace479 commented 4 months ago

@samuelstevens, this should work as described in updated directions in PR #14. If you encounter any issues or something seems off, please make suggestions there too.

penfever commented 3 months ago

@samuelstevens @thompsonmj @egrace479, just curious, has this bug been resolved by this PR?

thompsonmj commented 3 months ago

Apologies for the delay @penfever. There is still a lingering issue producing the catalog where common names are not reproduced as expected. We hope to resolve it within the next week.

thompsonmj commented 2 months ago

Issue with null common names resolved.

thompsonmj commented 2 months ago

@penfever, I believe this completes the fixes needed to address #13.

Could you please attempt to build the dataset from this branch following these instructions?

Please reach out if there are additional questions and also if you are able to complete the reconstruction. It will be helpful to know that it can work in someone else's hands.

egrace479 commented 2 months ago

@thompsonmj, following this step in our directions:

conda env create -f requirements-training.yml --solver=libmamba -y
conda activate bioclip-train
pip install -e .

we wind up uninstalling and re-installing a bunch of packages like torch and huggingface_hub to get older versions (those listed in requirements.txt instead of the ones in requirements-training.yml. Seems potentially redundant to do both. Should we instead have a tol environment to do pip install -e . in?

thompsonmj commented 2 months ago

... we wind up uninstalling and re-installing a bunch of packages ...

We may want to consider cleaning up the environment ...

Yes, we can address this in another PR.

Reproduction seems to be working on our end.

Right, based on my latest full dataset reproduction test run, all splits end up as expected.