OSU-NLP-Group / HippoRAG

HippoRAG is a novel RAG framework inspired by human long-term memory that enables LLMs to continuously integrate knowledge across external documents.
https://arxiv.org/abs/2405.14831
MIT License
790 stars 68 forks source link

Faiss assertion 'err #15

Open Lbaiall opened 3 weeks ago

Lbaiall commented 3 weeks ago

everything is all set,but when i get running index function,it get stuck in end commend line ,and my linux cuda version is 12.1 and my also pytorch version is 12.1,dose's anyone have the same erro? it seem like in Faiss-gpu erro

ner_gpt-3.5-turbo-1106_3 0it [00:00, ?it/s] 0it [00:00, ?it/s] 0it [00:00, ?it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 17050.02it/s] 0it [00:00, ?it/s] | 0/1 [00:00<?, ?it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 14122.24it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 16256.99it/s] 0it [00:00, ?it/s] 0it [00:00, ?it/s] 0it [00:00, ?it/s] /home/ai/HippoRAG/.venv/lib/python3.10/site-packages/numpy/core/fromnumeric.py:3504: RuntimeWarning: Mean of empty slice. return _methods._mean(a, axis=axis, dtype=dtype, /home/ai/HippoRAG/.venv/lib/python3.10/site-packages/numpy/core/_methods.py:129: RuntimeWarning: invalid value encountered in scalar divide ret = ret.dtype.type(ret / rcount) OpenIE saved to output/openie_sample_results_ner_gpt-3.5-turbo-1106_3.json Passage NER already saved to output/sample_queries.named_entity_output.tsv 100%|██████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 10856.70it/s] Correct Wiki Format: 0 out of 3 100%|██████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 13797.05it/s]

[Jun 14, 13:46:21] #> Note: Output directory colbert/indexes/nbits_2 already exists

[Jun 14, 13:46:21] #> Will delete 1 files already at colbert/indexes/nbits_2 in 20 seconds...

> Starting...

nranks = 1 num_gpus = 1 device=0 { "query_token_id": "[unused0]", "doc_token_id": "[unused1]", "query_token": "[Q]", "doc_token": "[D]", "ncells": null, "centroid_score_threshold": null, "ndocs": null, "load_index_with_mmap": false, "index_path": null, "index_bsize": 64, "nbits": 2, "kmeans_niters": 20, "resume": false, "similarity": "cosine", "bsize": 64, "accumsteps": 1, "lr": 1e-5, "maxsteps": 400000, "save_every": null, "warmup": 20000, "warmup_bert": null, "relu": false, "nway": 64, "use_ib_negatives": true, "reranker": false, "distillation_alpha": 1.0, "ignore_scores": false, "model_name": null, "query_maxlen": 32, "attend_to_mask_tokens": false, "interaction": "colbert", "dim": 128, "doc_maxlen": 180, "mask_punctuation": true, "checkpoint": "exp\/colbertv2.0", "triples": "\/future\/u\/okhattab\/root\/unit\/experiments\/2021.10\/downstream.distillation.round2.2_score\/round2.nway6.cosine.ib\/examples.64.json", "collection": "data\/lm_vectors\/colbert\/corpus.tsv", "queries": "\/future\/u\/okhattab\/data\/MSMARCO\/queries.train.tsv", "index_name": "nbits_2", "overwrite": false, "root": "", "experiment": "colbert", "index_root": null, "name": "2024-06\/14\/13.46.17", "rank": 0, "nranks": 1, "amp": true, "gpus": 1, "avoid_fork_if_possible": false } [Jun 14, 13:46:47] #> Loading collection... 0M [Jun 14, 13:46:50] [0] # of sampled PIDs = 29 sampled_pids[:3] = [13, 23, 0] [Jun 14, 13:46:50] [0] #> Encoding 29 passages.. [Jun 14, 13:46:51] [0] avg_doclen_est = 7.103448390960693 len(local_sample) = 29 [Jun 14, 13:46:51] [0] Creating 128 partitions. [Jun 14, 13:46:51] [0] Estimated 206 embeddings. [Jun 14, 13:46:51] [0] #> Saving the indexing plan to colbert/indexes/nbits_2/plan.json .. WARNING clustering 196 points to 128 centroids: please provide at least 4992 training points Clustering 196 points in 128D to 128 clusters, redo 1 times, 20 iterations Preprocessing in 0.00 s Faiss assertion 'err == CUBLAS_STATUS_SUCCESS' failed in void faiss::gpu::runMatrixMult(faiss::gpu::Tensor<float, 2, true>&, bool, faiss::gpu::Tensor<T, 2, true>&, bool, faiss::gpu::Tensor<IndexType, 2, true>&, bool, float, float, cublasHandle_t, cudaStream_t) [with AT = float; BT = float; cublasHandle_t = cublasContext; cudaStream_t = CUstream_st] at /project/faiss/faiss/gpu/utils/MatrixMult-inl.cuh:265; details: cublas failed (13): (196, 128) x (128, 128)' = (196, 128) gemm params m 128 n 196 k 128 trA T trB N lda 128 ldb 128 ldc 128

yhshu commented 2 weeks ago

Hello, could you execute this for your conda environment and try again:

pip install setuptools==69.5.1
Lbaiall commented 2 weeks ago

@yhshu No it still suck right here ,and i guess that is caseing by venv? i m not use the conda ,just local file python .venv ,or just something in my cuda version or my gpu hardware ? WARNING clustering 196 points to 128 centroids: please provide at least 4992 training points Clustering 196 points in 128D to 128 clusters, redo 1 times, 20 iterations Preprocessing in 0.00 s Faiss assertion 'err == CUBLAS_STATUS_SUCCESS' failed in void faiss::gpu::runMatrixMult(faiss::gpu::Tensor<float, 2, true>&, bool, faiss::gpu::Tensor<T, 2, true>&, bool, faiss::gpu::Tensor<IndexType, 2, true>&, bool, float, float, cublasHandle_t, cudaStream_t) [with AT = float; BT = float; cublasHandle_t = cublasContext; cudaStream_t = CUstream_st] at /project/faiss/faiss/gpu/utils/MatrixMult-inl.cuh:265; details: cublas failed (13): (196, 128) x (128, 128)' = (196, 128) gemm params m 128 n 196 k 128 trA T trB N lda 128 ldb 128 ldc 128

yhshu commented 2 weeks ago

Do you install all packages with versions specified in requirements.txt? Because any version difference could cause such an error.

Lbaiall commented 2 weeks ago

it work! thanks ,but ....... the new erro ....... File "/home/ai/HippoRAG/src/colbertv2_indexing.py", line 41, in kb_phrase_dict = pickle.load(open(args.phrase, 'rb')) FileNotFoundError: [Errno 2] No such file or directory: 'output/sample_facts_and_sim_graph_phrase_dict_ents_only_lower_preprocess_ner.v3.subset.p'

yhshu commented 2 weeks ago

Please always post all related commands you executed, so we can help you.

Lbaiall commented 2 weeks ago

root@DESKTOP-9O20ND7:/home/ai/HippoRAG# bash 11.sh src/setup_hipporag_colbert.sh: line 9: python: command not found src/setup_hipporag_colbert.sh: line 10: python: command not found src/setup_hipporag_colbert.sh: line 13: python: command not found src/setup_hipporag_colbert.sh: line 16: python: command not found src/setup_hipporag_colbert.sh: line 17: python: command not found src/setup_hipporag_colbert.sh: line 19: python: command not found

[Jun 15, 20:49:51] #> Note: Output directory data/lm_vectors/colbert/sample/corpus/indexes/nbits_2 already exists

[Jun 15, 20:49:51] #> Will delete 10 files already at data/lm_vectors/colbert/sample/corpus/indexes/nbits_2 in 20 seconds...

> Starting...

nranks = 1 num_gpus = 1 device=0 { "query_token_id": "[unused0]", "doc_token_id": "[unused1]", "query_token": "[Q]", "doc_token": "[D]", "ncells": null, "centroid_score_threshold": null, "ndocs": null, "load_index_with_mmap": false, "index_path": null, "index_bsize": 64, "nbits": 2, "kmeans_niters": 20, "resume": false, "similarity": "cosine", "bsize": 64, "accumsteps": 1, "lr": 1e-5, "maxsteps": 400000, "save_every": null, "warmup": 20000, "warmup_bert": null, "relu": false, "nway": 64, "use_ib_negatives": true, "reranker": false, "distillation_alpha": 1.0, "ignore_scores": false, "model_name": null, "query_maxlen": 32, "attend_to_mask_tokens": false, "interaction": "colbert", "dim": 128, "doc_maxlen": 180, "mask_punctuation": true, "checkpoint": "exp\/colbertv2.0", "triples": "\/future\/u\/okhattab\/root\/unit\/experiments\/2021.10\/downstream.distillation.round2.2_score\/round2.nway6.cosine.ib\/examples.64.json", "collection": "data\/lm_vectors\/colbert\/sample_corpus_3.tsv", "queries": "\/future\/u\/okhattab\/data\/MSMARCO\/queries.train.tsv", "index_name": "nbits_2", "overwrite": false, "root": "data\/lm_vectors\/colbert\/sample", "experiment": "corpus", "index_root": null, "name": "2024-06\/15\/20.49.49", "rank": 0, "nranks": 1, "amp": true, "gpus": 1, "avoid_fork_if_possible": false } [Jun 15, 20:50:14] #> Loading collection... 0M [Jun 15, 20:50:17] [0] # of sampled PIDs = 3 sampled_pids[:3] = [1, 0, 2] [Jun 15, 20:50:17] [0] #> Encoding 3 passages.. [Jun 15, 20:50:19] [0] avg_doclen_est = 90.33333587646484 len(local_sample) = 3 [Jun 15, 20:50:19] [0] Creating 256 partitions. [Jun 15, 20:50:19] [0] Estimated 271 embeddings. [Jun 15, 20:50:19] [0] #> Saving the indexing plan to data/lm_vectors/colbert/sample/corpus/indexes/nbits_2/plan.json .. WARNING clustering 258 points to 256 centroids: please provide at least 9984 training points Clustering 258 points in 128D to 256 clusters, redo 1 times, 20 iterations Preprocessing in 0.00 s Iteration 19 (0.06 s, search 0.03 s): objective=0.0608023 imbalance=1.008 nsplit=0 [Jun 15, 20:50:20] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)... [Jun 15, 20:50:20] Loading packbits_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)... [0.024, 0.04, 0.04, 0.033, 0.039, 0.047, 0.022, 0.026, 0.037, 0.056, 0.035, 0.04, 0.041, 0.038, 0.017, 0.035, 0.036, 0.025, 0.044, 0.03, 0.038, 0.039, 0.025, 0.04, 0.03, 0.051, 0.022, 0.043, 0.057, 0.052, 0.036, 0.038, 0.039, 0.042, 0.036, 0.054, 0.017, 0.041, 0.036, 0.02, 0.018, 0.048, 0.046, 0.048, 0.026, 0.042, 0.043, 0.044, 0.031, 0.041, 0.038, 0.039, 0.034, 0.019, 0.028, 0.049, 0.044, 0.024, 0.046, 0.027, 0.019, 0.039, 0.026, 0.033, 0.032, 0.03, 0.05, 0.024, 0.021, 0.023, 0.044, 0.039, 0.037, 0.036, 0.041, 0.026, 0.048, 0.033, 0.034, 0.038, 0.034, 0.033, 0.039, 0.034, 0.044, 0.054, 0.038, 0.028, 0.051, 0.035, 0.037, 0.019, 0.029, 0.034, 0.033, 0.038, 0.024, 0.045, 0.033, 0.049, 0.059, 0.045, 0.023, 0.047, 0.047, 0.03, 0.042, 0.036, 0.023, 0.02, 0.015, 0.025, 0.042, 0.034, 0.029, 0.024, 0.033, 0.027, 0.041, 0.022, 0.02, 0.046, 0.044, 0.047, 0.025, 0.036, 0.025, 0.038] [Jun 15, 20:50:20] #> Got bucket_cutoffs_quantiles = tensor([0.2500, 0.5000, 0.7500], device='cuda:0') and bucket_weights_quantiles = tensor([0.1250, 0.3750, 0.6250, 0.8750], device='cuda:0') [Jun 15, 20:50:20] #> Got bucket_cutoffs = tensor([-2.2236e-02, -7.6294e-06, 2.2995e-02], device='cuda:0') and bucket_weights = tensor([-0.0479, -0.0089, 0.0083, 0.0513], device='cuda:0') [Jun 15, 20:50:20] avg_residual = 0.0355224609375 0it [00:00, ?it/s][Jun 15, 20:50:20] [0] #> Encoding 3 passages.. [Jun 15, 20:50:20] [0] #> Saving chunk 0: 3 passages and 271 embeddings. From #0 onward. 1it [00:00, 27.65it/s] [Jun 15, 20:50:20] [0] #> Checking all files were saved... [Jun 15, 20:50:20] [0] Found all files! [Jun 15, 20:50:20] [0] #> Building IVF... [Jun 15, 20:50:20] [0] #> Loading codes... 100%|███████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1287.39it/s] [Jun 15, 20:50:20] [0] Sorting codes... [Jun 15, 20:50:20] [0] Getting unique codes... [Jun 15, 20:50:20] #> Optimizing IVF to store map from centroids to list of pids.. [Jun 15, 20:50:20] #> Building the emb2pid mapping.. [Jun 15, 20:50:20] len(emb2pid) = 271 100%|█████████████████████████████████████████████████████████████████████████████| 256/256 [00:00<00:00, 338079.92it/s] [Jun 15, 20:50:20] #> Saved optimized IVF to data/lm_vectors/colbert/sample/corpus/indexes/nbits_2/ivf.pid.pt [Jun 15, 20:50:20] [0] #> Saving the indexing metadata to data/lm_vectors/colbert/sample/corpus/indexes/nbits_2/metadata.json ..

> Joined...

Traceback (most recent call last): File "/home/ai/HippoRAG/src/colbertv2_indexing.py", line 41, in kb_phrase_dict = pickle.load(open(args.phrase, 'rb')) FileNotFoundError: [Errno 2] No such file or directory: 'output/sample_facts_and_sim_graph_phrase_dict_ents_only_lower_preprocess_ner.v3.subset.p'

yhshu commented 2 weeks ago

You definitely need to take care of the basic configuration first as I see python: command not found

yhshu commented 2 weeks ago

You may close this issue if there is no other problem in this thread, thanks.

Lbaiall commented 2 weeks ago

@yhshu when process step to right after #> Saving the indexing plan to colbert/indexes/nbits_2/plan.json .. it report that Number of training less than clusters,but i add more data in my sampledata file but it still away less for 12 point [Jun 20, 13:19:54] #> Loading collection... 0M [Jun 20, 13:19:57] [0] # of sampled PIDs = 12 sampled_pids[:3] = [6, 0, 4] [Jun 20, 13:19:57] [0] #> Encoding 12 passages.. [Jun 20, 13:19:58] [0] avg_doclen_est = 4.5 len(local_sample) = 12 [Jun 20, 13:19:58] [0] Creating 64 partitions. [Jun 20, 13:19:58] [0] Estimated 54 embeddings. [Jun 20, 13:19:58] [0] #> Saving the indexing plan to colbert/indexes/nbits_2/plan.json .. Process Process-2: Traceback (most recent call last): File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, *self._kwargs) File "/home/ai/HippoRAG/.venv/lib/python3.10/site-packages/colbert/infra/launcher.py", line 134, in setup_new_process return_val = callee(config, args) File "/home/ai/HippoRAG/.venv/lib/python3.10/site-packages/colbert/indexing/collection_indexer.py", line 33, in encode encoder.run(shared_lists) File "/home/ai/HippoRAG/.venv/lib/python3.10/site-packages/colbert/indexing/collection_indexer.py", line 68, in run self.train(shared_lists) # Trains centroids from selected passages File "/home/ai/HippoRAG/.venv/lib/python3.10/site-packages/colbert/indexing/collection_indexer.py", line 232, in train centroids = self._train_kmeans(sample, shared_lists) File "/home/ai/HippoRAG/.venv/lib/python3.10/site-packages/colbert/indexing/collection_indexer.py", line 304, in _train_kmeans centroids = compute_faisskmeans(*args) File "/home/ai/HippoRAG/.venv/lib/python3.10/site-packages/colbert/indexing/collection_indexer.py", line 507, in compute_faiss_kmeans kmeans.train(sample) File "/home/ai/HippoRAG/.venv/lib/python3.10/site-packages/faiss/init.py", line 1560, in train clus.train(x, self.index, weights) File "/home/ai/HippoRAG/.venv/lib/python3.10/site-packages/faiss/init.py", line 68, in replacement_train self.train_c(n, swig_ptr(x), index) File "/home/ai/HippoRAG/.venv/lib/python3.10/site-packages/faiss/swigfaiss.py", line 2328, in train return _swigfaiss.Clustering_train(self, n, x, index, x_weights) RuntimeError: Error in void faiss::Clustering::train_encoded(faiss::Clustering::idx_t, const uint8_t, const faiss::Index, faiss::Index&, const float*) at /project/faiss/faiss/Clustering.cpp:283: Error: 'nx >= k' failed: Number of training points (52) should be at least as large as number of clusters (64)

yhshu commented 2 weeks ago

I think this is an environmental issue rather than the data size. Could you check if your conda environments meet the requirements.txt first?