OSU-NLP-Group / HippoRAG

[NeurIPS'24] HippoRAG is a novel RAG framework inspired by human long-term memory that enables LLMs to continuously integrate knowledge across external documents. RAG + Knowledge Graphs + Personalized PageRank.
https://arxiv.org/abs/2405.14831
MIT License
1.42k stars 118 forks source link

Getting error in downloading sequence2sequence model from hugging face #6

Closed PrateekSharma007 closed 3 months ago

PrateekSharma007 commented 5 months ago

When running the test_hipporag.py file , I am getting an error
` python test_hipporag.py gpt-3.5-turbo-1106 colbertv2 hotpotqa ner Traceback (most recent call last): File "/opt/conda/envs/myenv/lib/python3.9/site-packages/huggingface_hub/utils/_errors.py", line 286, in hf_raise_for_status response.raise_for_status() File "/opt/conda/envs/myenv/lib/python3.9/site-packages/requests/models.py", line 1021, in raise_for_status raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/ner/resolve/main/config.json

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/opt/conda/envs/myenv/lib/python3.9/site-packages/transformers/utils/hub.py", line 385, in cached_file resolved_file = hf_hub_download( File "/opt/conda/envs/myenv/lib/python3.9/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn return fn(*args, *kwargs) File "/opt/conda/envs/myenv/lib/python3.9/site-packages/huggingface_hub/file_download.py", line 1368, in hf_hub_download raise head_call_error File "/opt/conda/envs/myenv/lib/python3.9/site-packages/huggingface_hub/file_download.py", line 1238, in hf_hub_download metadata = get_hf_file_metadata( File "/opt/conda/envs/myenv/lib/python3.9/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn return fn(args, **kwargs) File "/opt/conda/envs/myenv/lib/python3.9/site-packages/huggingface_hub/file_download.py", line 1631, in get_hf_file_metadata r = _request_wrapper( File "/opt/conda/envs/myenv/lib/python3.9/site-packages/huggingface_hub/file_download.py", line 385, in _request_wrapper response = _request_wrapper( File "/opt/conda/envs/myenv/lib/python3.9/site-packages/huggingface_hub/file_download.py", line 409, in _request_wrapper hf_raise_for_status(response) File "/opt/conda/envs/myenv/lib/python3.9/site-packages/huggingface_hub/utils/_errors.py", line 323, in hf_raise_for_status raise RepositoryNotFoundError(message, response) from e huggingface_hub.utils._errors.RepositoryNotFoundError: 404 Client Error. (Request ID: Root=1-665ef0e1-047320641ea765c23e417fb3;c716d4af-222e-4a04-8b9c-d50ae2d2ef54)

Repository Not Found for url: https://huggingface.co/ner/resolve/main/config.json. Please make sure you specified the correct repo_id and repo_type. If you are trying to access a private or gated repo, make sure you are authenticated. ` Can you help me resolve this issue ? I am bit confused .

bernaljg commented 5 months ago

Thanks for your interest!

We would love to help but need more information to reproduce this error. Did you run the indexing process as explained in the README at least with HotpotQA and ColBERTv2?

PrateekSharma007 commented 5 months ago

Hey! , thanks for responding After doing indexing I am getting this error indexing error Can you tell me what's wrong in this , model not found it is showing . thanks!

yhshu commented 5 months ago

Hello, have you set colbertv2.0 checkpoints under exp dir? You could check README.md to do that.

PrateekSharma007 commented 5 months ago

Yeah I already did that but still I am getting the same error like there is no url , invalid username and password .I did what was given in the READ me file. Screenshot 2024-06-05 173645

this is coming basically :

Repository Not Found for url: https://huggingface.co/exp/colbertv2.0/resolve/main/config.json.
Please make sure you specified the correct `repo_id` and `repo_type`.
If you are trying to access a private or gated repo, make sure you are authenticated.
bernaljg commented 5 months ago

This error is not happening on our side, could you please include the commands you are running in the screenshots so we can better assist you?

kartikkMindz commented 5 months ago

I have clone your code, i did the same way which said in the ReadMe.txt, First i have set the environment and then install all the required library: by doing this conda create -n hipporag python=3.9 conda activate hipporag pip install -r requirements.txt

GPU_DEVICES=0,1,2,3 #Replace with your own free GPU Devices export OPENAI_API_KEY='Add your own OpenAI API key here.' export TOGETHER_API_KEY='Add your own TogetherAI API key here.' # If you need to use TogetherAI models such as Llama-3 API

Then download the colbertV2 cd exp wget https://downloads.cs.stanford.edu/nlp/data/colbert/colbertv2/colbertv2.0.tar.gz tar –xvzf colbertv2.0.tar.gz

When i am indexing it with the colbertV2 getting this error: Repository Not Found for url: https://huggingface.co/exp/colbertv2.0/resolve/main/config.json. Please make sure you specified the correct repo_id and repo_type. If you are trying to access a private or gated repo, make sure you are authenticated. exp/colbertv2.0 is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models' If this is a private repository, make sure to pass a token having permission to this repo either by logging in with huggingface-cli login or by passing token=<your_token> i have did the huggingface-cli login and set the token then again getting same error.

After that i have tried with the Indexing with HuggingFace Retrieval Encoder for Synonymy Edges (i.e. Contriever) that work properly no error came,

Then i try the run the script of test_hipporag.py which is like import argparse from hipporag import HippoRAG

if name == 'main': parser = argparse.ArgumentParser() parser.add_argument('--dataset', type=str) parser.add_argument('--extraction_model', type=str, default='gpt-3.5-turbo-1106') parser.add_argument('--retrieval_model', type=str, choices=['facebook/contriever', 'colbertv2']) parser.add_argument('--doc_ensemble', action='store_true') args = parser.parse_args()

hipporag = HippoRAG(args.dataset, args.extraction_model, args.retrieval_model, doc_ensemble=args.doc_ensemble)

queries = ["Which Stanford University professor works on Alzheimer's"]
for query in queries:
    ranks, scores, logs = hipporag.rank_docs(query, top_k=10)

    print(ranks)
    print(scores)

    but getting error:
    (base) drops-ai-model@deeplearning-vm-f2-vm:~/HippoRAG/src$ python3 test_hipporag.py

Traceback (most recent call last): File "/home/drops-ai-model/HippoRAG/src/test_hipporag.py", line 2, in from hipporag import HippoRAG File "/home/drops-ai-model/HippoRAG/src/hipporag.py", line 11, in from named_entity_extraction_parallel import * File "/home/drops-ai-model/HippoRAG/src/named_entity_extraction_parallel.py", line 19, in from src.langchain_util import init_langchain_model ModuleNotFoundError: No module named 'src'

yhshu commented 5 months ago

This looks like a working directory or environment variable setup issue, where the environment doesn't recognize the HippoRAG root. E.g., during your log, after you cd exp, you should return back to the root.

bernaljg commented 5 months ago

@kartikkMindz can you tell use what command you ran for indexing using ColBERTv2? Did you run bash src/setup_hipporag_colbert.sh $DATA $LLM $GPUS $SYNONYM_THRESH $LLM_API?

kartikkMindz commented 5 months ago

@bernaljg Yes i have run the bash src/setup_hipporag_colbert.sh $DATA $LLM $GPUS $SYNONYM_THRESH $LLM_API for ColBERTv2

bernaljg commented 5 months ago

Could you send us the whole output which appears after you run bash src/setup_hipporag_colbert.sh $DATA $LLM $GPUS $SYNONYM_THRESH $LLM_API?

If you can also print out the bash variables using echo $DATA $LLM $GPUS $SYNONYM_THRESH $LLM_API and send us the output that would be great.

kartikkMindz commented 5 months ago

This is the whole output when i am doing the bash src/setup_hipporag_colbert.sh $DATA $LLM $GPUS $SYNONYM_THRESH $LLM_API

Output: (base) drops-ai-model@deeplearning-vm-f2-vm:~/HippoRAG$ bash src/setup_hipporag_colbert.sh $DATA $LLM $GPUS $SYNONYM_THRESH $LLM_API ner_gpt-3.5-turbo-1106_3 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 12052.60it/s] 0it [00:00, ?it/s] | 0/1 [00:00<?, ?it/s] 0it [00:00, ?it/s] 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 15420.24it/s] 0it [00:00, ?it/s] 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 10459.61it/s] 0it [00:00, ?it/s] 0it [00:00, ?it/s] 0it [00:00, ?it/s] 0it [00:00, ?it/s] /opt/conda/lib/python3.10/site-packages/numpy/core/fromnumeric.py:3504: RuntimeWarning: Mean of empty slice. return _methods._mean(a, axis=axis, dtype=dtype, /opt/conda/lib/python3.10/site-packages/numpy/core/_methods.py:129: RuntimeWarning: invalid value encountered in scalar divide ret = ret.dtype.type(ret / rcount) OpenIE saved to output/openie_sample_results_ner_gpt-3.5-turbo-1106_3.json Passage NER already saved to output/sample_queries.named_entity_output.tsv 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 7719.58it/s] Correct Wiki Format: 0 out of 3 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 9799.78it/s]

[Jun 07, 04:19:18] #> Note: Output directory colbert/indexes/nbits_2 already exists

> Starting...

nranks = 1 num_gpus = 4 device=0 { "query_token_id": "[unused0]", "doc_token_id": "[unused1]", "query_token": "[Q]", "doc_token": "[D]", "ncells": null, "centroid_score_threshold": null, "ndocs": null, "load_index_with_mmap": false, "index_path": null, "index_bsize": 64, "nbits": 2, "kmeans_niters": 4, "resume": false, "similarity": "cosine", "bsize": 64, "accumsteps": 1, "lr": 3e-6, "maxsteps": 500000, "save_every": null, "warmup": null, "warmup_bert": null, "relu": false, "nway": 2, "use_ib_negatives": false, "reranker": false, "distillation_alpha": 1.0, "ignore_scores": false, "model_name": null, "query_maxlen": 32, "attend_to_mask_tokens": false, "interaction": "colbert", "dim": 128, "doc_maxlen": 220, "mask_punctuation": true, "checkpoint": "exp\/colbertv2.0", "triples": null, "collection": "data\/lm_vectors\/colbert\/corpus.tsv", "queries": null, "index_name": "nbits_2", "overwrite": false, "root": "", "experiment": "colbert", "index_root": null, "name": "2024-06\/07\/04.19.16", "rank": 0, "nranks": 1, "amp": true, "gpus": 4, "avoid_fork_if_possible": false } [Jun 07, 04:19:24] #> Loading collection... 0M Process Process-2: Traceback (most recent call last): File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_errors.py", line 286, in hf_raise_for_status response.raise_for_status() File "/opt/conda/lib/python3.10/site-packages/requests/models.py", line 1021, in raise_for_status raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/exp/colbertv2.0/resolve/main/config.json

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/opt/conda/lib/python3.10/site-packages/transformers/utils/hub.py", line 385, in cached_file resolved_file = hf_hub_download( File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn return fn(*args, *kwargs) File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1368, in hf_hub_download raise head_call_error File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1238, in hf_hub_download metadata = get_hf_file_metadata( File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn return fn(args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1631, in get_hf_file_metadata r = _request_wrapper( File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 385, in _request_wrapper response = _request_wrapper( File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 409, in _request_wrapper hf_raise_for_status(response) File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_errors.py", line 323, in hf_raise_for_status raise RepositoryNotFoundError(message, response) from e huggingface_hub.utils._errors.RepositoryNotFoundError: 404 Client Error. (Request ID: Root=1-66628a4c-3882575e4704b0d579ddcbc6;5c7c786e-637c-405c-ae02-401d484d99fb)

Repository Not Found for url: https://huggingface.co/exp/colbertv2.0/resolve/main/config.json. Please make sure you specified the correct repo_id and repo_type. If you are trying to access a private or gated repo, make sure you are authenticated.

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, *self._kwargs) File "/opt/conda/lib/python3.10/site-packages/colbert/infra/launcher.py", line 134, in setup_new_process return_val = callee(config, args) File "/opt/conda/lib/python3.10/site-packages/colbert/indexing/collection_indexer.py", line 32, in encode encoder = CollectionIndexer(config=config, collection=collection, verbose=verbose) File "/opt/conda/lib/python3.10/site-packages/colbert/indexing/collection_indexer.py", line 52, in init self.checkpoint = Checkpoint(self.config.checkpoint, colbert_config=self.config) File "/opt/conda/lib/python3.10/site-packages/colbert/modeling/checkpoint.py", line 19, in init super().init(name, colbert_config) File "/opt/conda/lib/python3.10/site-packages/colbert/modeling/colbert.py", line 21, in init super().init(name, colbert_config) File "/opt/conda/lib/python3.10/site-packages/colbert/modeling/base_colbert.py", line 36, in init self.model = HF_ColBERT.from_pretrained(name_or_path, colbert_config=self.colbert_config) File "/opt/conda/lib/python3.10/site-packages/colbert/modeling/hf_colbert.py", line 133, in from_pretrained obj = super().from_pretrained(name_or_path, colbert_config=colbert_config) File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2926, in from_pretrained resolved_config_file = cached_file( File "/opt/conda/lib/python3.10/site-packages/transformers/utils/hub.py", line 406, in cached_file raise EnvironmentError( OSError: exp/colbertv2.0 is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models' If this is a private repository, make sure to pass a token having permission to this repo either by logging in with huggingface-cli login or by passing token=<your_token>

yhshu commented 5 months ago

@kartikkMindz exp/colbertv2.0 is not a HuggingFace model and should be set by ourselves. Could you check if the working directory is set correctly so the transformers package can find it?

bernaljg commented 5 months ago

yeah, basically make sure that tar -xvzf colbertv2.0.tar.gz ran smoothly and created the directory exp/colbertv2.0 with all the necessary model components.

PrateekSharma007 commented 5 months ago

Hi , I am providing you the video link of what i did , i cropped the part of setting of open ai api key . https://drive.google.com/file/d/11d3xlniz7SuR6ku1O7UaWyMXUJy3dbqd/view?usp=sharing

yhshu commented 5 months ago

Thank you for your recording. I think the ColBERT model was not extracted successfully from the tar.gz file.

image

This is probably because the command shown in README is using a wrong character -. Please get into exp and extract the model again:

tar -xvzf colbertv2.0.tar.gz
PrateekSharma007 commented 5 months ago

Ohh Thank you soo much

PrateekSharma007 commented 5 months ago

Heyy , the model issue got resolved but I guess I am facing the last error Screenshot 2024-06-12 004541

yhshu commented 5 months ago

It's good to hear that. Please change your working directory to HippoRAG root rather than src and try again. Thanks!

PrateekSharma007 commented 5 months ago

Yeah , i did this then its showing me that file not found . Test_hipporag is in src folder .

yhshu commented 5 months ago

Could you try to add your HippoRAG root path to the Python environment variable?

One way to do this is to add these lines at the top of test_hipporag.py:

import sys
sys.path.append('.')

Or any other way you'd like to add the path to environment variable PYTHONPATH.

PrateekSharma007 commented 5 months ago

See I have attached the screenshot . What happening is that it's showing me error in finding the src.hipporag . Even if I change the src.Hipporag there are many files linked which show the same error . I did the change which you said earlier . Screenshot 2024-06-14 222637 Screenshot 2024-06-14 222716

yhshu commented 5 months ago

Make sure you execute python test_hipporag.py when your working directory is HippoRAG root, i.e., ~/HippoRAG in your case.

PrateekSharma007 commented 5 months ago

Yes I did that , it says its unable to find the file . I will add the screenshot Screenshot 2024-06-14 223515

yhshu commented 5 months ago

Oh you definitely need to change that to python src/test_hipporag.py when your dir is HippoRAG root

PrateekSharma007 commented 5 months ago

Sure I will check and update then .

PrateekSharma007 commented 5 months ago

Hey , do I need to change anything in the code or just cloning and running the steps are all good ? I want to try it so used the data which was already given .

yhshu commented 5 months ago

For now, I think it's just a matter of the environment in which you're executing the code. Go ahead testing and post any questions you have, please.

PrateekSharma007 commented 5 months ago

I again cloned the repo , so now the errors which were coming earlier are now fixed . This is the last issue i guess Screenshot 2024-06-16 182118 Screenshot 2024-06-16 182256 I am using colbert so I changed it to that and updated the path of the dataset

yhshu commented 5 months ago

This is not a problem with this repo. You must pass the parameter to this program if required is True, default=some value is just a default value for your reference.

PrateekSharma007 commented 5 months ago

yes the problem is from my side .

kartikkMindz commented 5 months ago

when i am running the test_hipporag.py file it give me ranks, scores, log but i want the answer how to print that answer ?

import argparse from src.hipporag import HippoRAG

if name == 'main': parser = argparse.ArgumentParser() parser.add_argument('--dataset', type=str) parser.add_argument('--extraction_model', type=str, default='gpt-3.5-turbo-1106') parser.add_argument('--retrieval_model', type=str, choices=['facebook/contriever', 'colbertv2']) parser.add_argument('--doc_ensemble', action='store_true') args = parser.parse_args()

hipporag = HippoRAG(args.dataset, args.extraction_model, args.retrieval_model, doc_ensemble=args.doc_ensemble)

queries = ["Which Stanford University professor works on Alzheimer's"]
for query in queries:
    ranks, scores, logs = hipporag.rank_docs(query, top_k=10)

    print(ranks)
    print(scores)
yhshu commented 5 months ago

@kartikkMindz This is a new issue, and you could start a new post discussing this. I've submitted a PR to update how to use QA. It'll be merged soon. Stay tuned and thanks.

PrateekSharma007 commented 5 months ago

How are you going to use unstructured pdf's , unstructured data ? Right now it's quite specific .

yhshu commented 5 months ago

Text is unstructured data. PDF is an important RAG application and we welcome any contributions to that.