AnswerDotAI / RAGatouille

Easily use and train state of the art late-interaction retrieval methods (ColBERT) in any RAG pipeline. Designed for modularity and ease-of-use, backed by research.
Apache License 2.0
3.03k stars 206 forks source link

Indexing fails with 'DLL load failed while importing decompress_residuals_cpp: The specified module could not be found.' #183

Closed teleoflexuous closed 7 months ago

teleoflexuous commented 7 months ago

I'm trying to index text via

from ragatouille import RAGPretrainedModel
from ragatouille.utils import get_wikipedia_page

RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
my_documents = [get_wikipedia_page("Hayao_Miyazaki"), get_wikipedia_page("Studio_Ghibli")]
index_path = RAG.index(index_name="my_index", collection=my_documents)

It fails while loading 'decompress_residuals_cpp' inside colbert. I'm reporting the issue here, because somewhat similar, I think, issues were considered valid and resolved (?) here, like https://github.com/bclavie/RAGatouille/issues/60.

No index_name received! Using default index_name (colbert-ir/colbertv2.0_new_index)
---- WARNING! You are using PLAID with an experimental replacement for FAISS for greater compatibility ----
This is a behaviour change from RAGatouille 0.8.0 onwards.
This works fine for most users and smallish datasets, but can be considerably slower than FAISS and could cause worse results in some situations.
If you're confident with FAISS working on your machine, pass use_faiss=True to revert to the FAISS-using behaviour.
--------------------

[Mar 31, 12:47:13] #> Note: Output directory .ragatouille/colbert\indexes/colbert-ir/colbertv2.0new_index already exists

[Mar 31, 12:47:13] #> Will delete 1 files already at .ragatouille/colbert\indexes/colbert-ir/colbertv2.0new_index in 20 seconds...
[Mar 31, 12:47:35] [0]           #> Encoding 651 passages..
[Mar 31, 12:47:42] [0]           avg_doclen_est = 178.21812438964844     len(local_sample) = 651
[Mar 31, 12:47:42] [0]           Creating 4,096 partitions.
[Mar 31, 12:47:42] [0]           *Estimated* 116,019 embeddings.
[Mar 31, 12:47:42] [0]           #> Saving the indexing plan to .ragatouille/colbert\indexes/colbert-ir/colbertv2.0new_index\plan.json ..
used 20 iterations (3.2569s) to cluster 110219 items into 4096 clusters
[Mar 31, 12:47:46] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
c:\(...)\pypoetry\Cache\virtualenvs\my_project-9wt2rQWL-py3.12\Lib\site-packages\torch\utils\cpp_extension.py:381: UserWarning: Error checking compiler version for cl: [WinError 2] The system cannot find the file specified
  warnings.warn(f'Error checking compiler version for {compiler}: {error}')
INFO: Could not find files for the given pattern(s).
PyTorch-based indexing did not succeed with error: Command '['where', 'cl']' returned non-zero exit status 1. ! Reverting to using FAISS and attempting again...
________________________________________________________________________________       
WARNING! You have a GPU available, but only `faiss-cpu` is currently installed.        
 This means that indexing will be slow. To make use of your GPU.
Please install `faiss-gpu` by running:
pip uninstall --y faiss-cpu & pip install faiss-gpu
 ________________________________________________________________________________      
Will continue with CPU indexing in 5 seconds...

[Mar 31, 12:47:51] #> Note: Output directory .ragatouille/colbert\indexes/colbert-ir/colbertv2.0new_index already exists

[Mar 31, 12:47:51] #> Will delete 1 files already at .ragatouille/colbert\indexes/colbert-ir/colbertv2.0new_index in 20 seconds...
[Mar 31, 12:48:13] [0]           #> Encoding 651 passages..
[Mar 31, 12:48:16] [0]           avg_doclen_est = 178.21812438964844     len(local_sample) = 651
[Mar 31, 12:48:16] [0]           Creating 4,096 partitions.
[Mar 31, 12:48:16] [0]           *Estimated* 116,019 embeddings.
[Mar 31, 12:48:16] [0]           #> Saving the indexing plan to .ragatouille/colbert\indexes/colbert-ir/colbertv2.0new_index\plan.json ..
WARNING clustering 110219 points to 4096 centroids: please provide at least 159744 training points
Clustering 110219 points in 128D to 4096 clusters, redo 1 times, 20 iterations
  Preprocessing in 0.04 s
  Iteration 0 (0.62 s, search 0.61 s): objective=43093.8 imbalance=1.576 nsplit=0      
  Iteration 1 (1.39 s, search 1.38 s): objective=27179.7 imbalance=1.472 nsplit=2      
  Iteration 2 (2.02 s, search 2.00 s): objective=24918.9 imbalance=1.445 nsplit=0      
  Iteration 3 (2.75 s, search 2.72 s): objective=23978.2 imbalance=1.434 nsplit=0      
  Iteration 4 (3.54 s, search 3.50 s): objective=23512.9 imbalance=1.429 nsplit=0      
  Iteration 5 (4.23 s, search 4.19 s): objective=23266.7 imbalance=1.426 nsplit=0      
  Iteration 6 (4.94 s, search 4.89 s): objective=23119.8 imbalance=1.426 nsplit=0      
  Iteration 7 (5.81 s, search 5.75 s): objective=23025.6 imbalance=1.425 nsplit=0      
  Iteration 8 (6.52 s, search 6.45 s): objective=22965.2 imbalance=1.424 nsplit=0      
  Iteration 9 (7.24 s, search 7.16 s): objective=22923.1 imbalance=1.423 nsplit=0      
  Iteration 10 (7.93 s, search 7.85 s): objective=22893.9 imbalance=1.423 nsplit=0     
  Iteration 11 (8.76 s, search 8.67 s): objective=22870.4 imbalance=1.422 nsplit=0     
  Iteration 13 (10.32 s, search 10.21 s): objective=22833.1 imbalance=1.422 nsplit=0   
  Iteration 14 (11.15 s, search 11.02 s): objective=22812.9 imbalance=1.422 nsplit=0   
  Iteration 15 (12.08 s, search 11.94 s): objective=22802.8 imbalance=1.422 nsplit=0   
  Iteration 16 (12.84 s, search 12.69 s): objective=22796.2 imbalance=1.422 nsplit=0   
  Iteration 17 (13.48 s, search 13.33 s): objective=22790.9 imbalance=1.422 nsplit=0   
  Iteration 18 (14.20 s, search 14.04 s): objective=22787.3 imbalance=1.422 nsplit=0   
  Iteration 19 (14.96 s, search 14.79 s): objective=22785.1 imbalance=1.422 nsplit=0   

[Mar 31, 12:48:31] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
Traceback (most recent call last):
  File "c:\(...)\Programs\Python\Python312\Lib\runpy.py", line 198, in _run_module_as_main
    return _run_code(code, main_globals, None,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\(...)\Programs\Python\Python312\Lib\runpy.py", line 88, 
in _run_code
    exec(code, run_globals)
  File "c:\(...)\.vscode\extensions\ms-python.debugpy-2024.2.0-win32-x64\bundled\libs\debugpy\adapter/../..\debugpy\launcher/../..\debugpy\__main__.py", line 39, in <module>
    cli.main()
  File "c:\(...)\.vscode\extensions\ms-python.debugpy-2024.2.0-win32-x64\bundled\libs\debugpy\adapter/../..\debugpy\launcher/../..\debugpy/..\debugpy\server\cli.py", line 430, in main
    run()
  File "c:\(...)\.vscode\extensions\ms-python.debugpy-2024.2.0-win32-x64\bundled\libs\debugpy\adapter/../..\debugpy\launcher/../..\debugpy/..\debugpy\server\cli.py", line 284, in run_file
    runpy.run_path(target, run_name="__main__")
  File "c:\(...)\.vscode\extensions\ms-python.debugpy-2024.2.0-win32-x64\bundled\libs\debugpy\_vendored\pydevd\_pydevd_bundle\pydevd_runpy.py", line 321, in run_path   
    return _run_module_code(code, init_globals, run_name,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\(...)\.vscode\extensions\ms-python.debugpy-2024.2.0-win32-x64\bundled\libs\debugpy\_vendored\pydevd\_pydevd_bundle\pydevd_runpy.py", line 135, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "c:\(...)\.vscode\extensions\ms-python.debugpy-2024.2.0-win32-x64\bundled\libs\debugpy\_vendored\pydevd\_pydevd_bundle\pydevd_runpy.py", line 124, in _run_code  
    exec(code, run_globals)
  File "F:\(...)\questions_generator.py", line 70, in <module>
    questions_generator = QuestionsGenerator(
                          ^^^^^^^^^^^^^^^^^^^
  File "F:\(...)\questions_generator.py", line 49, in __init__
    self.index_documents()
  File "F:\(...)\questions_generator.py", line 67, in index_documents
    self.model.index(self.document_texts, self.document_ids)
  File "c:\(...)\pypoetry\Cache\virtualenvs\my_project-9wt2rQWL-py3.12\Lib\site-packages\ragatouille\RAGPretrainedModel.py", line 211, in index     
    return self.model.index(
           ^^^^^^^^^^^^^^^^^
  File "c:\(...)\pypoetry\Cache\virtualenvs\my_project-9wt2rQWL-py3.12\Lib\site-packages\ragatouille\models\colbert.py", line 341, in index
    self.model_index = ModelIndexFactory.construct(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\(...)\pypoetry\Cache\virtualenvs\my_project-9wt2rQWL-py3.12\Lib\site-packages\ragatouille\models\index.py", line 485, in construct       
    return ModelIndexFactory._MODEL_INDEX_BY_NAME[
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\(...)\pypoetry\Cache\virtualenvs\my_project-9wt2rQWL-py3.12\Lib\site-packages\ragatouille\models\index.py", line 150, in construct       
    return PLAIDModelIndex(config).build(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\(...)\pypoetry\Cache\virtualenvs\my_project-9wt2rQWL-py3.12\Lib\site-packages\ragatouille\models\index.py", line 254, in build
    indexer.index(name=index_name, collection=collection, overwrite=overwrite)
  File "c:\(...)\pypoetry\Cache\virtualenvs\my_project-9wt2rQWL-py3.12\Lib\site-packages\colbert\indexer.py", line 80, in index
    self.__launch(collection)
  File "c:\(...)\pypoetry\Cache\virtualenvs\my_project-9wt2rQWL-py3.12\Lib\site-packages\colbert\indexer.py", line 89, in __launch
    launcher.launch_without_fork(self.config, collection, shared_lists, shared_queues, 
self.verbose)
  File "c:\(...)\pypoetry\Cache\virtualenvs\my_project-9wt2rQWL-py3.12\Lib\site-packages\colbert\infra\launcher.py", line 93, in launch_without_fork    return_val = run_process_without_mp(self.callee, new_config, *args)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\(...)\pypoetry\Cache\virtualenvs\my_project-9wt2rQWL-py3.12\Lib\site-packages\colbert\infra\launcher.py", line 109, in run_process_without_mp
    return_val = callee(config, *args)
                 ^^^^^^^^^^^^^^^^^^^^^
  File "c:\(...)\pypoetry\Cache\virtualenvs\my_project-9wt2rQWL-py3.12\Lib\site-packages\colbert\indexing\collection_indexer.py", line 33, in encode    encoder.run(shared_lists)
  File "c:\(...)\pypoetry\Cache\virtualenvs\my_project-9wt2rQWL-py3.12\Lib\site-packages\colbert\indexing\collection_indexer.py", line 68, in run   
    self.train(shared_lists) # Trains centroids from selected passages
    ^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\(...)\pypoetry\Cache\virtualenvs\my_project-9wt2rQWL-py3.12\Lib\site-packages\colbert\indexing\collection_indexer.py", line 237, in train    bucket_cutoffs, bucket_weights, avg_residual = self._compute_avg_residual(centroids, heldout)
                                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\(...)\pypoetry\Cache\virtualenvs\my_project-9wt2rQWL-py3.12\Lib\site-packages\colbert\indexing\collection_indexer.py", line 315, in _compute_avg_residual
    compressor = ResidualCodec(config=self.config, centroids=centroids, avg_residual=None)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\(...)\pypoetry\Cache\virtualenvs\my_project-9wt2rQWL-py3.12\Lib\site-packages\colbert\indexing\codecs\residual.py", line 24, in __init__ 
    ResidualCodec.try_load_torch_extensions(self.use_gpu)
  File "c:\(...)\pypoetry\Cache\virtualenvs\my_project-9wt2rQWL-py3.12\Lib\site-packages\colbert\indexing\codecs\residual.py", line 103, in try_load_torch_extensions
    decompress_residuals_cpp = load(
                               ^^^^^
  File "c:\(...)\pypoetry\Cache\virtualenvs\my_project-9wt2rQWL-py3.12\Lib\site-packages\torch\utils\cpp_extension.py", line 1306, in load
    return _jit_compile(
           ^^^^^^^^^^^^^
  File "c:\(...)\pypoetry\Cache\virtualenvs\my_project-9wt2rQWL-py3.12\Lib\site-packages\torch\utils\cpp_extension.py", line 1736, in _jit_compile  
    return _import_module_from_library(name, build_directory, is_python_module)        
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^        
  File "c:\(...)\pypoetry\Cache\virtualenvs\my_project-9wt2rQWL-py3.12\Lib\site-packages\torch\utils\cpp_extension.py", line 2132, in _import_module_from_library
    module = importlib.util.module_from_spec(spec)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 813, in module_from_spec
  File "<frozen importlib._bootstrap_external>", line 1288, in create_module
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
ImportError: DLL load failed while importing decompress_residuals_cpp: The specified module could not be found.

I'm trying to run it on Windows, via poetry. CUDA and torch work in other parts of this project, within the same env.

My pyproject.toml

[tool.poetry]
(...)

[tool.poetry.dependencies]
python = "^3.12"
pellets = "^0.1.0"
nvidia-cuda-runtime-cu12 = "^12.4.99"
faiss-gpu = {version = "^1.7.2", platform = "windows"}

[tool.poetry.group.download.dependencies]
requests = "^2.31.0"
wget = "^3.2"
pylatexenc = {version = "^3.0a21", allow-prereleases = true}
pypdf = "^4.1.0"
pymupdf = "^1.23.26"
pillow = "^10.2.0"
beautifulsoup4 = "^4.12.3"

[tool.poetry.group.text_processing.dependencies]
pdf2image = "^1.17.0"
pytesseract = "^0.3.10"
pypdf2 = "^3.0.1"
colbert-ai = "^0.2.19"
ragatouille = "^0.0.8.post2"

[tool.poetry.group.transcriptions.dependencies]
torch = {version = "^2.2.0+cu121", source = "pytorch"}
torchvision = {version = "^0.17.0+cu121", source = "pytorch"}
pendulum = {git = "https://github.com/sdispater/pendulum.git"}
mutagen = "^1.47.0"
openai-whisper = {git = "https://github.com/openai/whisper.git"}
whisperx = {git = "https://github.com/m-bain/whisperx.git"}
ctranslate2 = "^4.1.0"

[[tool.poetry.source]]
name = "pytorch"
url = "https://download.pytorch.org/whl/cu118"
priority = "supplemental"

[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"

That's my first interaction with the library, so I hope I provided all the relevant information.

bclavie commented 7 months ago

This isn't directly a problem with RAGatouille, rather it seems to be due to how Windows is attempting to build the custom C++ code. We don't "officially" aim to support Windows, but I think many have been managing to run it, at least using WSL2. I'm not super familiar with Windows, but the problem does seem to be with a library needed to build this extension not having its DLL present -- maybe it could be fixed by installing/updating VS Code C++ tools?

teleoflexuous commented 7 months ago

Thanks for update! As far as I know C++ in Windows is handled by components coming with Visual Studio and I couldn't find anything in terms of VS Code C++ tools that would affect it. I might have missed something, but in my limited interactions with C++ that was the case.

I did try updating Visual Studio and the script fails in the same way. I needed to run it without WSL2 this time, but that's on me I guess. I understand you're not supporting Windows currently.

Thanks again!

bclavie commented 7 months ago

Oops, I did mean Visual Studio C++ utils, not VS Code (force of habit 😄)

Sorry I can't be more helpful here -- hopefully WSL2 solves your problems!