MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.2k stars 767 forks source link

ci: speed up by using uv to install python packages #2082

Closed afuetterer closed 4 months ago

afuetterer commented 4 months ago

What does this PR do?

This is a proposal to use the uv for installing python packages in CI. It concurrently downloads the packages from pypi and improves the time of the install step significantly. From 3+ minutes to 30+ seconds.

What do you think?

Before submitting

MaartenGr commented 4 months ago

I love the rustification of all these packages! Haven't used uv myself but this much of a speed up is more than welcome. The only thing I need to be confirmed is that the process of installing BERTopic is technically not the same as through pip. Users are advised using pip directly to install BERTopic which might lead to a different set of dependencies although I doubt this will be the case in practice.

Having said that, when we test using uv but users install using pip are we then sure the use case of users (pip) is properly tested?

afuetterer commented 4 months ago

If you look at a car or a bike, rust is usually a bad thing. But in this case, it is really great with the massive performance improvements.

Actually I am not sure about your question. Your package requires pandas>=1.1.5, which is a really old lower bound. If pip and uv use different resolver strategies, it might indeed lead to different versions of e.g. pandas. But is that a big problem?

I don't know, its your call.

If users use uv pip install bertopic, they will have faster install times, even if you don't use it in CI.

MaartenGr commented 4 months ago

Actually I am not sure about your question. Your package requires pandas>=1.1.5, which is a really old lower bound. If pip and uv use different resolver strategies, it might indeed lead to different versions of e.g. pandas. But is that a big problem?

It might be a problem if one of those versions results in bugs that we haven't tested through the unit tests. If uv gives a different version of pandas than pip, then it is technically possible that BERTopic installed with pip might be broken somewhere that we haven't detected yet. Over the last years, BERTopic has been broken a couple of times with newer versions of dependencies whether it was due to API changes (even when it weren't major new versions) or due to wheels that were not correctly built.

If users use uv pip install bertopic, they will have faster install times, even if you don't use it in CI.

That's true, the speed-up is definitely there. I am just wondering how many users will actually be doing this. In my experience, >90% of users will just run pip install bertopic also because many users are not that familiar with python.

I'm gonna be honest with you here, I'm not sure what the chance actually is that the testing pipeline using uv is different from a bare pip and the extend to which. That makes it also difficult for me to understand the actual impact.

afuetterer commented 4 months ago

Yes, sure, things can break, if the API of a dependency changes, and there is no upper bound in place.

It's your decision. Happy to close this, if not desired. Just wanted to show the difference in installation time.

MaartenGr commented 4 months ago

Let's close it for now and re-open when we run into issues with the speed up of the pipeline or if uv becomes more of a staple amongst most users.

afuetterer commented 4 months ago

Sure, no problem.