Update the Gene Ontology Annotations in 2024

dhimmel commented 3 weeks ago

Motivated by https://github.com/dhimmel/gene-ontology/issues/5

@NegarJanani to attempt to update this repo, you can:

install the conda environment
follow the readme execution command

You will likely hit snags, but we can figure it out when you do

NegarJanani commented 2 weeks ago

I installed the Conda environment and executed the command. I encountered some minor errors:

NetworkX Compatibility Issue:

networkx==2.6 doesn't support graph.node[go_id] in process.ipynb. I resolved this by changing it to graph.nodes[go_id].

Python Version Compatibility Issue:

In Python 3.9 and later, the gcd function was removed from the fractions module and is now available in the math module. The gcd function was previously part of NetworkX packages.

As a workaround, I downgraded to Python 3.8. The code ran fine initially, but I encountered a kernel error that I couldn’t resolve:

 ***raise DeadKernelError("Kernel died") from None***
***nbclient.exceptions.DeadKernelError: Kernel died***

I would appreciate any guidance on resolving the kernel issue.

dhimmel commented 2 weeks ago

It looks like your networkx and python versions are newer than those pinned in environment.yml:

https://github.com/dhimmel/gene-ontology/blob/d57fd938f90c79152449f5cd23d3c438a19ac2f5/environment.yml#L5-L12

How did you install the conda environment? Did you try:

conda env create --file=environment.yml

It might be good to upgrade the conda environment and code to work with newer versions, but let's see if we can get the old environment to install first.

The dead kernel could be an out of memory error. How much memory do you have available? I was likely running this on a pretty beefy machine.

NegarJanani commented 2 weeks ago

I didn't use the conda env create --file=environment.yml command to install the Conda environment initially. Instead, I created the environment with:

conda create --name myenv

I have 32 GB of physical memory, but after doing some calculations on the memory used and cached files, I’m left with only about 6 GB of available memory. This might be contributing to the issue.

I’ll go ahead and install the Conda environment using:

conda env create --file=environment.yml

I’ll also try running it on another machine to see if that resolves the problem.

NegarJanani commented 2 weeks ago

I tried running:

conda env create --file=environment.yml

However, I encountered the following error:

It appears that Conda couldn't find these specific package versions in the conda-forge channel.

dhimmel commented 2 weeks ago

Okay, you could unpin everything to get latest versions and then re-add the pins with whatever version resolves. This will require more code updates, but might be the best choice if these old conda packages binaries no longer exist.

Another option would be to switch to poetry for managing the environment. Example of what poetry in repo looks like here. Poetry is nice because it creates a lock files that includes versions of implicit dependencies.

dhimmel commented 2 weeks ago

Or instead of conda or poetry, you could try the newest and snazziest option of https://docs.astral.sh/uv/.

NegarJanani commented 2 weeks ago

It seems the pined versions in environment.yml file: gene-ontology/environment.yml
Lines 5 to 14 in ae04e74:

  - conda-forge::networkx=2.6
  - conda-forge::numpy=1.24.3
  - conda-forge::pandas=2.0.3
  - conda-forge::python=3.8.19
  - conda-forge::requests=2.32.3
  - conda-forge::notebook=7.2.1
  - conda-forge::nbconvert=7.16.4
  - conda-forge::ipykernel=6.29.5
  - pip:
    - obonet==1.1.0

Additionally, minor changes in process.ipynb can be seen here:

commit ae04e74, specifically lines 390, 404, and 409.

    "        graph.node[go_id][key].add(gene)\n",
    "        graph.nodes[go_id][key].add(gene)\n",

These changes should help update the web interface and annotations. I plan to run the command on an HPC to address the memory issue. If it fails, I will consider using poetry or UV.

dhimmel commented 2 weeks ago

Nice work @NegarJanani.

Feel free to open a draft pull request if you'd like more feedback while working on these changes.

For development you could do something like the following to limit memory usage:

gene_df = utilities.read_gene_info(download_dir).head(10_000)

Have I mentioned that I'd eventually love if we get this to run on a scheduled basis on CI?

NegarJanani commented 2 weeks ago

I’ve opened a pull request for the two changes I’ve made so far. I may need to make additional updates to get everything working correctly.

I’ve been running the code on an HPC for two days now. When I checked the results, I noticed a discrepancy: the web version shows 45 taxids, but my file from the HPC contains 1,997 taxids. I also reviewed some files from the last version in 2018 and saw that the numbers have changed. The process is still running, and I’m currently assessing how long it will take and what further changes may happen.

By the way, the idea of running this on a scheduled basis using CI is excellent!

dhimmel commented 2 weeks ago

the web version shows 45 taxids, but my file from the HPC contains 1,997 taxids

To clarify, you are rerunning with newer/current data rather than reusing the old data?

The increase from 25 species to 1,997 is a lot. For development, I'd limit to a couple species like human and rat.

NegarJanani commented 1 week ago

I believe the links in run.sh are pointing to the latest version of the data. I can either limit the number of taxons in the utilities.py script and rerun it, or I can keep only the 45 taxons and remove the others before pushing the changes back to the repository.

dhimmel commented 1 week ago

I can either limit the number of taxons in the utilities.py script and rerun it, or I can keep only the 45 taxons and remove the others before pushing the changes

I would limit the taxons somewhere in the code, possibly to the 45 that were already supported. You will want the benefits of filtering taxons as early in the processing pipeline as possible to save computation.

dhimmel / gene-ontology

Update the Gene Ontology Annotations in 2024 #6