flairNLP / flair

A very simple framework for state-of-the-art Natural Language Processing (NLP)
https://flairnlp.github.io/flair/
Other
13.89k stars 2.1k forks source link

[Feature]: upgrade urllib3 #3482

Closed david-waterworth closed 3 months ago

david-waterworth commented 3 months ago

Problem statement

flair requires urllib3<2.0.0 which is quite old (latest is 2.2, the 2.0 alpha was released in late 2022)

https://github.com/flairNLP/flair/blob/59bd7053a1a73da03293b0bdb68113cf383d6b9e/requirements.txt#L26C1-L26C75

I have other dependencies that require urllib3 >= 2.0 (i.e. tritonclient[http] (>=2.47.0,<3.0.0) requires urllib3 (>=2.0.7))

Solution

I'm not sure that flair directly uses urllib3, I suspect the require was added as per the comment due to 3rd party libraries not requiring urllib3 < 2 so resulting in a broken installation when urllib3 v2 was released. I would hope this is no longer a constraint and it can be removed?

Additional Context

In general I think flair imports to much by default, i.e. I have to downgrade scipy because even though I'm not using any gensim based features the fact that gensim uses deprecated scipy features that are now removed (triu) I get import errors (gensim has fixed this but not released the fix yet). Also note a recent "fix" added by gensim pins numpy < 2 which flair will inherit which could cause issues down the line.

I think most 3rd party (gensim, spacy, huggingface) functionality should be "optional" - i.e. pip install flair[gensim] to enable gensim features. I'm not a big fan of getting errors due to flair importing gensim when I'm training a tagger with transformer embeddings.

alanakbik commented 3 months ago

Hi @david-waterworth, generally agree that we have too many dependencies.

@helpmefindaname did an analysis of all dependencies and identified a core set of dependencies plus optional ones, also wanting to get rid of gensim. He proposed a solution essentially exactly like you, e.g. using something like pip install flair[gensim] to install bigger libraries.

My main problem is that gensim is used in some of our main taggers (for instance 'ner') and then this starter example would not work anymore: https://flairnlp.github.io/docs/intro#requirements-and-installation (installation plus example 1).

david-waterworth commented 3 months ago

The example appears to work for me without using gensim features? My assumption is gensim is used by the WordEmbeddings class in flair/embeddings/token.py (specifically gensim.models.KeyedVectors to load word2vec serialized files). My testing was a bit ad-hoc though so I could very well have missed something.

But tagger = Classifier.load('ner') appears to load a serialised pytorch state dict so (based on stepping through the code and putting breakpoints on lines that appear to use gensim) it seemed two work.

PS as an aside a transformer encoder fined tuned on my corpus using FLERT works extremely well!

helpmefindaname commented 3 months ago

Hi @david-waterworth

the urllib3<2.0.0 constraint was a hot fix to a dependency resolution issue (see this or this) that came up this year. We can remove that constraint again.

I agree that there are too many standard dependencies, @alanakbik will be discussing this again.

sglbl commented 3 months ago

I had the same error with gradio, I downgraded the versionn of gradio but still it was using urllib >= 2.0.

The conflict is caused by:
gradio 4.37.2 depends on urllib3~=2.0
flair 0.13.1 depends on urllib3<2.0.0 and >=1.0.0

At the end my solution was this:

pip install --use-deprecated=legacy-resolver -r requirements.txt

But it's deprecated and soon pip can remove this feature without notifying it and I'm not sure how can I install both of them together.

david-waterworth commented 3 months ago

@helpmefindaname I don't think those issues are really problems that should be fixed this way - my understanding of the issue is for packages that don't include a manifest as a separate file, poetry has to download and extract the wheel, and it does it once only and for many packages (unless you clear the cache), not just urllib. It's annoying the first time but users could always pin urllib3 in their project. From memory the main issue is boto3 doesn't support the manifest format and there's tons of different wheels which initially triggered poetry to download every boto3 wheel to check dependencies when urllib3 > 2.0.0 was released.