Open dhimmel opened 3 weeks ago
I installed the Conda environment and executed the command. I encountered some minor errors:
As a workaround, I downgraded to Python 3.8. The code ran fine initially, but I encountered a kernel error that I couldn’t resolve:
***raise DeadKernelError("Kernel died") from None***
***nbclient.exceptions.DeadKernelError: Kernel died***
I would appreciate any guidance on resolving the kernel issue.
It looks like your networkx and python versions are newer than those pinned in environment.yml
:
How did you install the conda environment? Did you try:
conda env create --file=environment.yml
It might be good to upgrade the conda environment and code to work with newer versions, but let's see if we can get the old environment to install first.
The dead kernel could be an out of memory error. How much memory do you have available? I was likely running this on a pretty beefy machine.
I didn't use the conda env create --file=environment.yml command to install the Conda environment initially. Instead, I created the environment with:
conda create --name myenv
I have 32 GB of physical memory, but after doing some calculations on the memory used and cached files, I’m left with only about 6 GB of available memory. This might be contributing to the issue.
I’ll go ahead and install the Conda environment using:
conda env create --file=environment.yml
I’ll also try running it on another machine to see if that resolves the problem.
I tried running:
conda env create --file=environment.yml
However, I encountered the following error:
It appears that Conda couldn't find these specific package versions in the conda-forge
channel.
Okay, you could unpin everything to get latest versions and then re-add the pins with whatever version resolves. This will require more code updates, but might be the best choice if these old conda packages binaries no longer exist.
Another option would be to switch to poetry
for managing the environment. Example of what poetry in repo looks like here. Poetry is nice because it creates a lock files that includes versions of implicit dependencies.
Or instead of conda
or poetry
, you could try the newest and snazziest option of https://docs.astral.sh/uv/.
It seems the pined versions in environment.yml
file:
gene-ontology/environment.yml
Lines 5 to 14 in ae04e74:
- conda-forge::networkx=2.6
- conda-forge::numpy=1.24.3
- conda-forge::pandas=2.0.3
- conda-forge::python=3.8.19
- conda-forge::requests=2.32.3
- conda-forge::notebook=7.2.1
- conda-forge::nbconvert=7.16.4
- conda-forge::ipykernel=6.29.5
- pip:
- obonet==1.1.0
Additionally, minor changes in process.ipynb can be seen here:
commit ae04e74, specifically lines 390, 404, and 409.
" graph.node[go_id][key].add(gene)\n",
" graph.nodes[go_id][key].add(gene)\n",
These changes should help update the web interface and annotations. I plan to run the command on an HPC to address the memory issue. If it fails, I will consider using poetry or UV.
Nice work @NegarJanani.
Feel free to open a draft pull request if you'd like more feedback while working on these changes.
For development you could do something like the following to limit memory usage:
gene_df = utilities.read_gene_info(download_dir).head(10_000)
Have I mentioned that I'd eventually love if we get this to run on a scheduled basis on CI?
I’ve opened a pull request for the two changes I’ve made so far. I may need to make additional updates to get everything working correctly.
I’ve been running the code on an HPC for two days now. When I checked the results, I noticed a discrepancy: the web version shows 45 taxids, but my file from the HPC contains 1,997 taxids. I also reviewed some files from the last version in 2018 and saw that the numbers have changed. The process is still running, and I’m currently assessing how long it will take and what further changes may happen.
By the way, the idea of running this on a scheduled basis using CI is excellent!
the web version shows 45 taxids, but my file from the HPC contains 1,997 taxids
To clarify, you are rerunning with newer/current data rather than reusing the old data?
The increase from 25 species to 1,997 is a lot. For development, I'd limit to a couple species like human and rat.
I believe the links in run.sh are pointing to the latest version of the data. I can either limit the number of taxons in the utilities.py script and rerun it, or I can keep only the 45 taxons and remove the others before pushing the changes back to the repository.
I can either limit the number of taxons in the utilities.py script and rerun it, or I can keep only the 45 taxons and remove the others before pushing the changes
I would limit the taxons somewhere in the code, possibly to the 45 that were already supported. You will want the benefits of filtering taxons as early in the processing pipeline as possible to save computation.
Motivated by https://github.com/dhimmel/gene-ontology/issues/5
@NegarJanani to attempt to update this repo, you can:
You will likely hit snags, but we can figure it out when you do