ga4gh / vrs-python

GA4GH Variation Representation Python Implementation
https://github.com/ga4gh/vrs
Apache License 2.0
51 stars 27 forks source link

Eliminate "extra" dependencies #389

Open jsstevenson opened 6 months ago

jsstevenson commented 6 months ago

I'm not sure if a firm decision has been made on this but I've heard a few thoughts expressed on the matter:

edit: had another thought to toss out: move extras to a different module level, a la

src
└── ga4gh
    ├── core
    ├── vrs    
    └── extras. # or "tools" or w/e
        ├── __init__.py
        ├── decorators.py
        ├── object_store.py
        ├── translator.py
        └── vcf_annotation.py
        └── utils
            └── hgvs_tools.py
korikuzma commented 6 months ago

@ga4gh/vrs-python-maintainers @ga4gh/vr-maintainers should provide thoughts

korikuzma commented 6 months ago
  • Rename "extras" to something more expressive ("tools" or something?)

I am pro more meaningful name. Tools is fine to me

  • Merge the extra dependencies into the main dependencies
  • Separate the models/serialization business and the extra tooling into separately installable packages ("core" vs "tools" ?)

I would prefer to keep them separate since in gene-normalizer, we only use the Pydantic models and would like to keep deps lightweight

jsstevenson commented 6 months ago

I would prefer to keep them separate since in gene-normalizer, we only use the Pydantic models and would like to keep deps lightweight

Yeah I agree that simply merging the dependencies would cause some unnecessary difficulties for some niche but not infrequent use cases. I guess the question is

1) if we keep in same installable, is it weird to have the requirements for the most common use cases live under an optional dependency group?

2) if we separate out into different packages, will it suck to maintain a models/core/serialization package separately from a tools (translator, annotator) package?

final note: I noticed a few days ago that NetworkX is doing something like this -- you can theoretically install it without Numpy, etc but will hit import errors very quickly, so they recommend you just install with the optional [default] dependency group: https://networkx.org/documentation/stable/install.html#install-the-released-version

theferrit32 commented 6 months ago

I think it's okay to have a frequently used optional dependencies target. The core library functionality of the models and serialization stuff should be usable without those. Not being able to normalize alleles without it is a bit of a limitation.

If there wasn't such a size difference in the installed size between them I'd say it wouldn't be that big of a deal to merge them. Adding the extras adds a lot (though I would expect most users to add them).

docker run -it --rm python:3.11 bash -c 'du -hs /usr/local/lib/python3.11/site-packages'

13M /usr/local/lib/python3.11/site-packages

docker run -it --rm python:3.11 bash -c 'pip install "ga4gh.vrs @ git+https://github.com/ga4gh/vrs-python" && du -hs /usr/local/lib/python3.11/site-packages'

34M /usr/local/lib/python3.11/site-packages

docker run -it --rm python:3.11 bash -c 'pip install "ga4gh.vrs[extras] @ git+https://github.com/ga4gh/vrs-python" && du -hs /usr/local/lib/python3.11/site-packages'

182M    /usr/local/lib/python3.11/site-packages
jsstevenson commented 5 months ago

@theferrit32 ~40MB of that is PySam (by way of HGVS -> SeqRepo -> PySam), which maybe explains why the DataProxy stuff is reimplemented here instead of being imported from SeqRepo...