crate / cratedb-examples

A collection of clear and concise examples how to work with CrateDB.
Apache License 2.0
9 stars 7 forks source link

RAG: Problems resolving dependencies on Google Colab #424

Closed amotl closed 5 months ago

amotl commented 5 months ago

Problem

@hammerhead reported a flaw with the cratedb_rag_customer_support_langchain.ipynb Notebook when invoked on Google Colab.

Dependency resolution around Dask fails, bzw. takes ages to complete, if at all.

Collecting dask (from fsspec[abfs,dask,gcs,git,github,http,s3,smb]<2024.3->pueblo[cli,fileio,nlp]>=0.0.7->-r https://raw.githubusercontent.com/crate/cratedb-examples/main/topic/machine-learning/llm-langchain/requirements.txt (line 9))
  Downloading dask-2022.5.2-py3-none-any.whl (1.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 65.4 MB/s eta 0:00:00
Collecting distributed (from fsspec[abfs,dask,gcs,git,github,http,s3,smb]<2024.3->pueblo[cli,fileio,nlp]>=0.0.7->-r https://raw.githubusercontent.com/crate/cratedb-examples/main/topic/machine-learning/llm-langchain/requirements.txt (line 9))
  Downloading distributed-2022.5.1-py3-none-any.whl (871 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 871.6/871.6 kB 54.8 MB/s eta 0:00:00
Collecting dask (from fsspec[abfs,dask,gcs,git,github,http,s3,smb]<2024.3->pueblo[cli,fileio,nlp]>=0.0.7->-r https://raw.githubusercontent.com/crate/cratedb-examples/main/topic/machine-learning/llm-langchain/requirements.txt (line 9))
  Downloading dask-2022.5.1-py3-none-any.whl (1.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 57.8 MB/s eta 0:00:00
Collecting distributed (from fsspec[abfs,dask,gcs,git,github,http,s3,smb]<2024.3->pueblo[cli,fileio,nlp]>=0.0.7->-r https://raw.githubusercontent.com/crate/cratedb-examples/main/topic/machine-learning/llm-langchain/requirements.txt (line 9))
  Downloading distributed-2022.5.0-py3-none-any.whl (856 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 856.7/856.7 kB 60.6 MB/s eta 0:00:00
Collecting dask (from fsspec[abfs,dask,gcs,git,github,http,s3,smb]<2024.3->pueblo[cli,fileio,nlp]>=0.0.7->-r https://raw.githubusercontent.com/crate/cratedb-examples/main/topic/machine-learning/llm-langchain/requirements.txt (line 9))
  Downloading dask-2022.5.0-py3-none-any.whl (1.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 59.9 MB/s eta 0:00:00
Collecting distributed (from fsspec[abfs,dask,gcs,git,github,http,s3,smb]<2024.3->pueblo[cli,fileio,nlp]>=0.0.7->-r https://raw.githubusercontent.com/crate/cratedb-examples/main/topic/machine-learning/llm-langchain/requirements.txt (line 9))
  Downloading distributed-2022.4.2-py3-none-any.whl (856 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 856.7/856.7 kB 61.6 MB/s eta 0:00:00
Collecting dask (from fsspec[abfs,dask,gcs,git,github,http,s3,smb]<2024.3->pueblo[cli,fileio,nlp]>=0.0.7->-r https://raw.githubusercontent.com/crate/cratedb-examples/main/topic/machine-learning/llm-langchain/requirements.txt (line 9))
  Downloading dask-2022.4.2-py3-none-any.whl (1.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 61.3 MB/s eta 0:00:00
Collecting distributed (from fsspec[abfs,dask,gcs,git,github,http,s3,smb]<2024.3->pueblo[cli,fileio,nlp]>=0.0.7->-r https://raw.githubusercontent.com/crate/cratedb-examples/main/topic/machine-learning/llm-langchain/requirements.txt (line 9))
  Downloading distributed-2022.4.1-py3-none-any.whl (855 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 855.5/855.5 kB 59.3 MB/s eta 0:00:00
Collecting dask (from fsspec[abfs,dask,gcs,git,github,http,s3,smb]<2024.3->pueblo[cli,fileio,nlp]>=0.0.7->-r https://raw.githubusercontent.com/crate/cratedb-examples/main/topic/machine-learning/llm-langchain/requirements.txt (line 9))
  Downloading dask-2022.4.1-py3-none-any.whl (1.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 60.5 MB/s eta 0:00:00
Collecting distributed (from fsspec[abfs,dask,gcs,git,github,http,s3,smb]<2024.3->pueblo[cli,fileio,nlp]>=0.0.7->-r https://raw.githubusercontent.com/crate/cratedb-examples/main/topic/machine-learning/llm-langchain/requirements.txt (line 9))
  Downloading distributed-2022.4.0-py3-none-any.whl (853 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 853.8/853.8 kB 52.3 MB/s eta 0:00:00
Collecting dask (from fsspec[abfs,dask,gcs,git,github,http,s3,smb]<2024.3->pueblo[cli,fileio,nlp]>=0.0.7->-r https://raw.githubusercontent.com/crate/cratedb-examples/main/topic/machine-learning/llm-langchain/requirements.txt (line 9))
  Downloading dask-2022.4.0-py3-none-any.whl (1.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 56.7 MB/s eta 0:00:00
Collecting distributed (from fsspec[abfs,dask,gcs,git,github,http,s3,smb]<2024.3->pueblo[cli,fileio,nlp]>=0.0.7->-r https://raw.githubusercontent.com/crate/cratedb-examples/main/topic/machine-learning/llm-langchain/requirements.txt (line 9))
  Downloading distributed-2022.3.0-py3-none-any.whl (851 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 851.2/851.2 kB 54.7 MB/s eta 0:00:00
Collecting dask (from fsspec[abfs,dask,gcs,git,github,http,s3,smb]<2024.3->pueblo[cli,fileio,nlp]>=0.0.7->-r https://raw.githubusercontent.com/crate/cratedb-examples/main/topic/machine-learning/llm-langchain/requirements.txt (line 9))
  Downloading dask-2022.3.0-py3-none-any.whl (1.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 60.3 MB/s eta 0:00:00
Requirement already satisfied: httplib2>=0.9.1 in /usr/local/lib/python3.10/dist-packages (from oauth2client>=1.5.2->gcsfs->fsspec[abfs,dask,gcs,git,github,http,s3,smb]<2024.3->pueblo[cli,fileio,nlp]>=0.0.7->-r https://raw.githubusercontent.com/crate/cratedb-examples/main/topic/machine-learning/llm-langchain/requirements.txt (line 9)) (0.22.0)
  Downloading dask-2023.8.1-py3-none-any.whl (1.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 7.8 MB/s eta 0:00:00
INFO: pip is looking at multiple versions of distributed to determine which version is compatible with other requirements. This could take a while.
INFO: pip is looking at multiple versions of distributed to determine which version is compatible with other requirements. This could take a while.
INFO: This is taking longer than usual. You might need to provide the dependency resolver with stricter constraints to reduce runtime. See https://pip.pypa.io/warnings/backtracking for guidance. If you want to abort this run, press Ctrl + C.
ERROR: Operation cancelled by user

Thoughts

It looks like it is clearly related to the Python 3.11.9 vs. Dask hiccup from last week.

References

Maybe related; I will execute this first; maybe, it will yield some insights.

@hammerhead also provided a fix already.

amotl commented 5 months ago

Observations

@hammerhead reported that he experienced the problems in the area of pueblo[fileio]. As of its v0.0.9, this is the list of dependencies of this extra:

fileio = [
  "fsspec[abfs,dask,gcs,git,github,http,s3,smb]<2024.3",
  "pathlibfs<0.6",
  "python-magic<0.5",
  "yarl<1.10",
]

-- https://github.com/pyveci/pueblo/blob/v0.0.9/pyproject.toml#L95-L100

On the other hand, this is the list of dependencies of current development head, which deviates in the version of fsspec, which, in turn, also pulls in dask, but without specifying any version.

fileio = [
  "fsspec[abfs,dask,gcs,git,github,http,s3,smb]<2024.4",
  "pathlibfs<0.6",
  "python-magic<0.5",
  "yarl<1.10",
]

-- https://github.com/pyveci/pueblo/blob/71ff638/pyproject.toml

Thoughts

I don't see any immediate problems with the combination of dependencies, but the devil may well be in the details.

amotl commented 5 months ago

Trial-and-error version pinning

Both of those unpinned dependencies had recent releases:

Given that the local topic/machine-learning/llm-langchain/requirements.txt was updated on Mon Apr 15 2024, only the update to google-cloud-aiplatform 1.48.0 might be relevant.

@hammerhead: Can I ask you to try / play around with those commands on Google Colab, in order to find out about the dependency in question which is causing the package solver to detour into infinity?

This command should cause the symptom you are observing, right?

pip install --upgrade https://github.com/crate/cratedb-examples/raw/main/topic/machine-learning/llm-langchain/requirements.txt

What about that?

pip install --upgrade https://github.com/crate/cratedb-examples/raw/main/topic/machine-learning/llm-langchain/requirements.txt "google-cloud-aiplatform<1.48"
amotl commented 5 months ago

@hammerhead reported that he experienced the problems in the area of pueblo[fileio].

This patch gets rid of the dependency completely.