AnswerDotAI / RAGatouille

Easily use and train state of the art late-interaction retrieval methods (ColBERT) in any RAG pipeline. Designed for modularity and ease-of-use, backed by research.
Apache License 2.0
2.83k stars 197 forks source link

Can't install on WSL 2 Windows 10 or Crashes (using faiss-gpu) #144

Closed grahama1970 closed 6 months ago

grahama1970 commented 7 months ago

Ragatouille works great in a colab. I can't seem to get it to install or run in WSL2. I've tried a jupyter notebook and python code as supplied by the docs, to no avail. I'd love to try again when it ready for WSL 2 :) I uninstalled faiss-cpu and installed faiss-gpu. I've got a single a5000 GPU 24 gigs.

bclavie commented 7 months ago

Hey, thank you for reporting this. I'm curious about the installation issue, do you mean that it doesn't install when running pip install (or your package manager of choice)?

grahama1970 commented 7 months ago

I've tried pip and conda. Can't seem to get either to work for ragatouille, yet. The google colab Jupyter notebook does work.

bclavie commented 6 months ago

Could you send over the output you get when you try to run the installation commands?

phaistos commented 6 months ago

I have ragatouille working under WSL2 with faiss-gpu. Afaik conda is the only way to install faiss-gpu (and for it to actually work). I did run into some issues along the way..

If conda installs it's own libstdc++.so you may run into an issue where the building of those CPP extensions will fail as your host's compiler may be expecting to link with a different/newer version of the standard library. You will see a missing GLIBC_... symbol error. The solution to this problem is to install gcc into your conda environment.

I use pipenv so I had to convince that tool to use the python and site packages from the conda environment to get access to faiss-gpu:

conda create python=3.10.11 -n my_env
conda install -n my_env pytorch/label/nightly::faiss-gpu conda-forge::gxx
pipenv --python=$(conda run -n my_env which python) --site-packages install
pipenv run pip uninstall -y faiss-cpu

And then I had to make sure the conda compiler was used the first time I run my example:

> CXX=~/.conda/envs/my_env/bin/c++ ./ragatouille_example.py

Hope this helps, the whole situation is a bit of a mess really.

TheMcSebi commented 6 months ago

In case you recently updated/reinstalled cuda on your windows host, you might need to reinstall the cuda-toolkit-12-3 package on wsl. A simple apt update apparently won't do, I recently had to find out.

grahama1970 commented 6 months ago

I tried the 01_basic_indexing_and_searching notebook again on WSL2 using a fresh conda install of 3.10.11 Would it be better to check back with ragatouille in a couple of months? I do look forward to see how the RAG approach performs in a runpod or WSL2 environment


WARNING! You have a GPU available, but only faiss-cpu is currently installed. This means that indexing will be slow. To make use of your GPU. Please install faiss-gpu by running: pip uninstall --y faiss-cpu & pip install faiss-gpu


Will continue with CPU indexing in 5 seconds...

[Feb 29, 08:02:29] #> Note: Output directory .ragatouille/colbert/indexes/Miyazaki already exists

[Feb 29, 08:02:29] #> Will delete 1 files already at .ragatouille/colbert/indexes/Miyazaki in 20 seconds... [Feb 29, 08:02:51] [0] #> Encoding 81 passages.. [Feb 29, 08:02:54] [0] avg_doclen_est = 129.82716369628906 len(local_sample) = 81 [Feb 29, 08:02:54] [0] Creating 1,024 partitions. [Feb 29, 08:02:54] [0] Estimated 10,516 embeddings. [Feb 29, 08:02:54] [0] #> Saving the indexing plan to .ragatouille/colbert/indexes/Miyazaki/plan.json ..

AttributeError Traceback (most recent call last) Cell In[4], line 1 ----> 1 RAG.index( 2 collection=[full_document], 3 document_ids=['miyazaki'], 4 document_metadatas=[{"entity": "person", "source": "wikipedia"}], 5 index_name="Miyazaki", 6 max_document_length=180, 7 split_documents=True 8 )

File ~/anaconda3/envs/raga_5/lib/python3.10/site-packages/ragatouille/RAGPretrainedModel.py:210, in RAGPretrainedModel.index(self, collection, document_ids, document_metadatas, index_name, overwrite_index, max_document_length, split_documents, document_splitter_fn, preprocessing_fn, bsize) 201 document_splitter_fn = None 202 collection, pid_docid_map, docid_metadata_map = self._process_corpus( 203 collection, 204 document_ids, (...) 208 max_document_length, 209 ) --> 210 return self.model.index( 211 collection, 212 pid_docid_map=pid_docid_map, 213 docid_metadata_map=docid_metadata_map, 214 index_name=index_name, ... --> 502 kmeans = faiss.Kmeans(dim, num_partitions, niter=kmeans_niters, gpu=use_gpu, verbose=True, seed=123) 504 sample = shared_lists[0][0] 505 sample = sample.float().numpy()

AttributeError: module 'faiss' has no attribute 'Kmeans'

phaistos commented 6 months ago

Maybe some out-of-date dependencies?

It's definitely working here with my personal project.

-> % uname -a
Linux Euminides 5.15.90.1-microsoft-standard-WSL2 #1 SMP Fri Jan 27 02:56:13 UTC 2023 x86_64 GNU/Linux
(llm-rag-examples) host@Euminides [08:43:50 AM] [~/llm-rag-examples] [master *]
-> % python -V
Python 3.10.11
(llm-rag-examples) host@Euminides [08:43:53 AM] [~/llm-rag-examples] [master *]
-> % ./rag_index_only.py
Choose a topic to pull RAG info for: tennis

[Feb 29, 08:44:14] #> Creating directory .ragatouille/colbert/indexes/rag_index_only_1709214254305

[Feb 29, 08:44:15] [0]           #> Encoding 112 passages..
[Feb 29, 08:44:16] [0]           avg_doclen_est = 132.89285278320312     len(local_sample) = 112
[Feb 29, 08:44:16] [0]           Creating 1,024 partitions.
[Feb 29, 08:44:16] [0]           *Estimated* 14,883 embeddings.
[Feb 29, 08:44:16] [0]           #> Saving the indexing plan to .ragatouille/colbert/indexes/rag_index_only_1709214254305/plan.json ..
WARNING clustering 14140 points to 1024 centroids: please provide at least 39936 training points
Clustering 14140 points in 128D to 1024 clusters, redo 1 times, 20 iterations
  Preprocessing in 0.00 s
  Iteration 19 (0.06 s, search 0.04 s): objective=3408.15 imbalance=1.404 nsplit=0
[Feb 29, 08:44:16] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
[Feb 29, 08:44:25] Loading packbits_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
[0.038, 0.039, 0.038, 0.034, 0.041, 0.035, 0.035, 0.038, 0.035, 0.035, 0.034, 0.036, 0.036, 0.039, 0.037, 0.037, 0.031, 0.039, 0.036, 0.039, 0.036, 0.04, 0.035, 0.038, 0.036, 0.037, 0.036, 0.035, 0.036, 0.039, 0.036, 0.041, 0.039, 0.035, 0.039, 0.032, 0.042, 0.038, 0.036, 0.043, 0.037, 0.035, 0.035, 0.036, 0.039, 0.034, 0.033, 0.039, 0.037, 0.039, 0.033, 0.034, 0.037, 0.039, 0.037, 0.034, 0.044, 0.037, 0.041, 0.036, 0.036, 0.035, 0.036, 0.037, 0.039, 0.038, 0.039, 0.037, 0.033, 0.036, 0.041, 0.036, 0.034, 0.037, 0.036, 0.038, 0.038, 0.04, 0.036, 0.039, 0.037, 0.036, 0.039, 0.044, 0.037, 0.041, 0.036, 0.039, 0.034, 0.039, 0.04, 0.037, 0.036, 0.04, 0.035, 0.035, 0.04, 0.036, 0.042, 0.039, 0.039, 0.044, 0.038, 0.037, 0.039, 0.036, 0.034, 0.036, 0.041, 0.037, 0.042, 0.04, 0.037, 0.034, 0.036, 0.036, 0.037, 0.039, 0.038, 0.037, 0.037, 0.037, 0.033, 0.039, 0.037, 0.036, 0.038, 0.034]
0it [00:00, ?it/s][Feb 29, 08:44:34] [0]                 #> Encoding 112 passages..
1it [00:00,  4.81it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 3326.17it/s]
[Feb 29, 08:44:35] #> Optimizing IVF to store map from centroids to list of pids..
[Feb 29, 08:44:35] #> Building the emb2pid mapping..
[Feb 29, 08:44:35] len(emb2pid) = 14884
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1024/1024 [00:00<00:00, 91374.51it/s]
[Feb 29, 08:44:35] #> Saved optimized IVF to .ragatouille/colbert/indexes/rag_index_only_1709214254305/ivf.pid.pt
Done indexing!
Enter a query: champion
Loading searcher for index rag_index_only_1709214254305 for the first time... This may take a few seconds
[Feb 29, 08:44:39] #> Loading codec...
[Feb 29, 08:44:39] #> Loading IVF...
[Feb 29, 08:44:39] #> Loading doclens...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 17848.10it/s]
[Feb 29, 08:44:39] #> Loading codes and residuals...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1349.95it/s]
Searcher loaded!

#> QueryTokenizer.tensorize(batch_text[0], batch_background[0], bsize) ==
#> Input: . champion,            True,           None
#> Output IDs: torch.Size([32]), tensor([ 101,    1, 3410,  102,  103,  103,  103,  103,  103,  103,  103,  103,
         103,  103,  103,  103,  103,  103,  103,  103,  103,  103,  103,  103,
         103,  103,  103,  103,  103,  103,  103,  103], device='cuda:0')
#> Output Mask: torch.Size([32]), tensor([1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0], device='cuda:0')

[Document(page_content='Together, these four events are called the Majors or Slams (a term borrowed from bridge rather than baseball).\nIn 1913, the International Lawn Tennis Federation (ILTF), now the International Tennis Federation (ITF), was founded and established three official tournaments as the major championships of the day. The World Grass Court Championships were awarded to Great Britain. The World Hard Court Championships were awarded to France; the term "hard court" was used for clay courts at the time. Some tournaments were held in Belgium instead. And the World Covered Court Championships for indoor courts were awarded annually; Sweden, France, Great Britain, Denmark, Switzerland and Spain each hosted the tournament.'), Document(page_content='Some tournaments were held in Belgium instead. And the World Covered Court Championships for indoor courts were awarded annually; Sweden, France, Great Britain, Denmark, Switzerland and Spain each hosted the tournament. At a meeting held on 16 March 1923 in Paris, the title "World Championship" was dropped and a new category of "Official Championship" was created for events in Great Britain, France, the US and Australia  – today\'s Grand Slam events.'), Document(page_content='Some observers, however, also felt that Kramer deserved consideration for the title. Kramer was among the few who dominated amateur and professional tennis during the late 1940s and early 1950s. Tony Trabert has said that of the players he saw before the start of the Open Era, Kramer was the best male champion.By the 1960s, Budge and others had added Pancho Gonzales and Lew Hoad to the list of contenders. Budge reportedly believed that Gonzales was the greatest player ever. Gonzales said about Hoad, "When Lew\'s game was at its peak nobody could touch him. ... I think his game was the best game ever. Better than mine. He was capable of making more shots than anybody. His two volleys were great. His overhead was enormous.')]
--
grahama1970 commented 6 months ago

Perhaps, this is a clue. Using the langchain version of FAISS seems to work. from langchain_community.vectorstores import FAISS # might be the fix here I guess that means I can test this thing out now. Look forward to it

Notebook Gist below running on WSL2 https://gist.github.com/grahama1970/9683c295bec75dc7c97d084873b9d1c5

bclavie commented 6 months ago

As this is a faiss issue, it should be fixed in 0.0.8, as long as you are indexing fewer than ~100k documents (with the new tentative faiss replacement just using PyTorch to perform k-means).