Getting a lot of File Not Found Errors trying to run the setup on the sample dataset

alew3 commented 3 weeks ago

System: Ubuntu 23.04 / Nvidia Titan RTX GPU NVidia Driver Version: 550.78 CUDA Version: 12.4

trying to run the example with Coulbert

DATA=sample
LLM=gpt-3.5-turbo
SYNONYM_THRESH=0.8
GPUS=0
LLM_API=openai

bash src/setup_hipporag_colbert.sh $DATA $LLM $GPUS $SYNONYM_THRESH $LLM_API

I get all these file not found errors

/home/ale/anaconda3/envs/hipporag/lib/python3.9/site-packages/numpy/core/fromnumeric.py:3504: RuntimeWarning: Mean of empty slice.
  return _methods._mean(a, axis=axis, dtype=dtype,
/home/ale/anaconda3/envs/hipporag/lib/python3.9/site-packages/numpy/core/_methods.py:129: RuntimeWarning: invalid value encountered in scalar divide
  ret = ret.dtype.type(ret / rcount)
OpenIE saved to output/openie_sample_results_ner_gpt-3.5-turbo_3.json
Passage NER already saved to output/sample_queries.named_entity_output.tsv
Traceback (most recent call last):
  File "/home/ale/anaconda3/envs/hipporag/lib/python3.9/site-packages/nltk/corpus/util.py", line 84, in __load
    root = nltk.data.find(f"{self.subdir}/{zip_name}")
  File "/home/ale/anaconda3/envs/hipporag/lib/python3.9/site-packages/nltk/data.py", line 583, in find
    raise LookupError(resource_not_found)
LookupError: 
**********************************************************************
  Resource stopwords not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('stopwords')

  For more information see: https://www.nltk.org/data.html

  Attempted to load corpora/stopwords.zip/stopwords/

  Searched in:
    - '/home/ale/nltk_data'
    - '/home/ale/anaconda3/envs/hipporag/nltk_data'
    - '/home/ale/anaconda3/envs/hipporag/share/nltk_data'
    - '/home/ale/anaconda3/envs/hipporag/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ale/projects/HippoRAG/src/create_graph.py", line 11, in <module>
    stop_words = set(stopwords.words('english'))
  File "/home/ale/anaconda3/envs/hipporag/lib/python3.9/site-packages/nltk/corpus/util.py", line 121, in __getattr__
    self.__load()
  File "/home/ale/anaconda3/envs/hipporag/lib/python3.9/site-packages/nltk/corpus/util.py", line 86, in __load
    raise e
  File "/home/ale/anaconda3/envs/hipporag/lib/python3.9/site-packages/nltk/corpus/util.py", line 81, in __load
    root = nltk.data.find(f"{self.subdir}/{self.__name}")
  File "/home/ale/anaconda3/envs/hipporag/lib/python3.9/site-packages/nltk/data.py", line 583, in find
    raise LookupError(resource_not_found)
LookupError: 
**********************************************************************
  Resource stopwords not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('stopwords')

  For more information see: https://www.nltk.org/data.html

  Attempted to load corpora/stopwords

  Searched in:
    - '/home/ale/nltk_data'
    - '/home/ale/anaconda3/envs/hipporag/nltk_data'
    - '/home/ale/anaconda3/envs/hipporag/share/nltk_data'
    - '/home/ale/anaconda3/envs/hipporag/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************

Traceback (most recent call last):
  File "/home/ale/projects/HippoRAG/src/colbertv2_knn.py", line 66, in <module>
    string_df = pd.read_csv(string_filename, sep='\t')
  File "/home/ale/anaconda3/envs/hipporag/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1024, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/home/ale/anaconda3/envs/hipporag/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 618, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/home/ale/anaconda3/envs/hipporag/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1618, in __init__
    self._engine = self._make_engine(f, self.engine)
  File "/home/ale/anaconda3/envs/hipporag/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1878, in _make_engine
    self.handles = get_handle(
  File "/home/ale/anaconda3/envs/hipporag/lib/python3.9/site-packages/pandas/io/common.py", line 873, in get_handle
    handle = open(
FileNotFoundError: [Errno 2] No such file or directory: 'output/kb_to_kb.tsv'
Traceback (most recent call last):
  File "/home/ale/projects/HippoRAG/src/colbertv2_knn.py", line 66, in <module>
    string_df = pd.read_csv(string_filename, sep='\t')
  File "/home/ale/anaconda3/envs/hipporag/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1024, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/home/ale/anaconda3/envs/hipporag/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 618, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/home/ale/anaconda3/envs/hipporag/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1618, in __init__
    self._engine = self._make_engine(f, self.engine)
  File "/home/ale/anaconda3/envs/hipporag/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1878, in _make_engine
    self.handles = get_handle(
  File "/home/ale/anaconda3/envs/hipporag/lib/python3.9/site-packages/pandas/io/common.py", line 873, in get_handle
    handle = open(
FileNotFoundError: [Errno 2] No such file or directory: 'output/query_to_kb.tsv'
Traceback (most recent call last):
  File "/home/ale/anaconda3/envs/hipporag/lib/python3.9/site-packages/nltk/corpus/util.py", line 84, in __load
    root = nltk.data.find(f"{self.subdir}/{zip_name}")
  File "/home/ale/anaconda3/envs/hipporag/lib/python3.9/site-packages/nltk/data.py", line 583, in find
    raise LookupError(resource_not_found)
LookupError: 
**********************************************************************
  Resource stopwords not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('stopwords')

  For more information see: https://www.nltk.org/data.html

  Attempted to load corpora/stopwords.zip/stopwords/

  Searched in:
    - '/home/ale/nltk_data'
    - '/home/ale/anaconda3/envs/hipporag/nltk_data'
    - '/home/ale/anaconda3/envs/hipporag/share/nltk_data'
    - '/home/ale/anaconda3/envs/hipporag/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ale/projects/HippoRAG/src/create_graph.py", line 11, in <module>
    stop_words = set(stopwords.words('english'))
  File "/home/ale/anaconda3/envs/hipporag/lib/python3.9/site-packages/nltk/corpus/util.py", line 121, in __getattr__
    self.__load()
  File "/home/ale/anaconda3/envs/hipporag/lib/python3.9/site-packages/nltk/corpus/util.py", line 86, in __load
    raise e
  File "/home/ale/anaconda3/envs/hipporag/lib/python3.9/site-packages/nltk/corpus/util.py", line 81, in __load
    root = nltk.data.find(f"{self.subdir}/{self.__name}")
  File "/home/ale/anaconda3/envs/hipporag/lib/python3.9/site-packages/nltk/data.py", line 583, in find
    raise LookupError(resource_not_found)
LookupError: 
**********************************************************************
  Resource stopwords not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('stopwords')

  For more information see: https://www.nltk.org/data.html

  Attempted to load corpora/stopwords

  Searched in:
    - '/home/ale/nltk_data'
    - '/home/ale/anaconda3/envs/hipporag/nltk_data'
    - '/home/ale/anaconda3/envs/hipporag/share/nltk_data'
    - '/home/ale/anaconda3/envs/hipporag/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************

[Jun 10, 18:05:59] #> Note: Output directory data/lm_vectors/colbert/sample/corpus/indexes/nbits_2 already exists

[Jun 10, 18:05:59] #> Will delete 10 files already at data/lm_vectors/colbert/sample/corpus/indexes/nbits_2 in 20 seconds...
#> Starting...
nranks = 1       num_gpus = 1    device=0
{
    "query_token_id": "[unused0]",
    "doc_token_id": "[unused1]",
    "query_token": "[Q]",
    "doc_token": "[D]",
    "ncells": null,
    "centroid_score_threshold": null,
    "ndocs": null,
    "load_index_with_mmap": false,
    "index_path": null,
    "index_bsize": 64,
    "nbits": 2,
    "kmeans_niters": 20,
    "resume": false,
    "similarity": "cosine",
    "bsize": 64,
    "accumsteps": 1,
    "lr": 1e-5,
    "maxsteps": 400000,
    "save_every": null,
    "warmup": 20000,
    "warmup_bert": null,
    "relu": false,
    "nway": 64,
    "use_ib_negatives": true,
    "reranker": false,
    "distillation_alpha": 1.0,
    "ignore_scores": false,
    "model_name": null,
    "query_maxlen": 32,
    "attend_to_mask_tokens": false,
    "interaction": "colbert",
    "dim": 128,
    "doc_maxlen": 180,
    "mask_punctuation": true,
    "checkpoint": "exp\/colbertv2.0",
    "triples": "\/future\/u\/okhattab\/root\/unit\/experiments\/2021.10\/downstream.distillation.round2.2_score\/round2.nway6.cosine.ib\/examples.64.json",
    "collection": "data\/lm_vectors\/colbert\/sample_corpus_3.tsv",
    "queries": "\/future\/u\/okhattab\/data\/MSMARCO\/queries.train.tsv",
    "index_name": "nbits_2",
    "overwrite": false,
    "root": "data\/lm_vectors\/colbert\/sample",
    "experiment": "corpus",
    "index_root": null,
    "name": "2024-06\/10\/18.05.57",
    "rank": 0,
    "nranks": 1,
    "amp": true,
    "gpus": 1,
    "avoid_fork_if_possible": false
}
[Jun 10, 18:06:22] #> Loading collection...
0M 
[Jun 10, 18:06:24] [0]           # of sampled PIDs = 3   sampled_pids[:3] = [1, 0, 2]
[Jun 10, 18:06:24] [0]           #> Encoding 3 passages..
[Jun 10, 18:06:24] [0]           avg_doclen_est = 90.33333587646484      len(local_sample) = 3
[Jun 10, 18:06:24] [0]           Creating 256 partitions.
[Jun 10, 18:06:24] [0]           *Estimated* 271 embeddings.
[Jun 10, 18:06:24] [0]           #> Saving the indexing plan to data/lm_vectors/colbert/sample/corpus/indexes/nbits_2/plan.json ..
WARNING clustering 258 points to 256 centroids: please provide at least 9984 training points
Clustering 258 points in 128D to 256 clusters, redo 1 times, 20 iterations
  Preprocessing in 0.00 s
  Iteration 19 (0.00 s, search 0.00 s): objective=0.0608484 imbalance=1.008 nsplit=0       
[Jun 10, 18:06:25] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
[Jun 10, 18:06:25] Loading packbits_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...
[0.024, 0.04, 0.04, 0.033, 0.038, 0.047, 0.022, 0.026, 0.037, 0.056, 0.035, 0.04, 0.042, 0.038, 0.017, 0.035, 0.036, 0.025, 0.044, 0.03, 0.038, 0.039, 0.025, 0.04, 0.03, 0.051, 0.022, 0.043, 0.057, 0.052, 0.036, 0.038, 0.039, 0.042, 0.036, 0.054, 0.017, 0.041, 0.036, 0.02, 0.018, 0.047, 0.046, 0.048, 0.026, 0.042, 0.043, 0.044, 0.031, 0.041, 0.038, 0.039, 0.034, 0.019, 0.028, 0.049, 0.044, 0.024, 0.046, 0.027, 0.019, 0.039, 0.026, 0.033, 0.032, 0.03, 0.05, 0.024, 0.021, 0.023, 0.044, 0.039, 0.037, 0.036, 0.041, 0.026, 0.048, 0.033, 0.034, 0.038, 0.034, 0.033, 0.039, 0.034, 0.044, 0.054, 0.038, 0.028, 0.051, 0.035, 0.037, 0.019, 0.029, 0.034, 0.033, 0.038, 0.024, 0.045, 0.033, 0.049, 0.059, 0.045, 0.023, 0.047, 0.047, 0.03, 0.042, 0.036, 0.023, 0.02, 0.015, 0.025, 0.042, 0.034, 0.029, 0.025, 0.033, 0.027, 0.041, 0.022, 0.02, 0.046, 0.044, 0.047, 0.025, 0.036, 0.025, 0.038]
[Jun 10, 18:06:25] #> Got bucket_cutoffs_quantiles = tensor([0.2500, 0.5000, 0.7500], device='cuda:0') and bucket_weights_quantiles = tensor([0.1250, 0.3750, 0.6250, 0.8750], device='cuda:0')
[Jun 10, 18:06:25] #> Got bucket_cutoffs = tensor([-0.0222,  0.0000,  0.0230], device='cuda:0') and bucket_weights = tensor([-0.0479, -0.0089,  0.0083,  0.0513], device='cuda:0')
[Jun 10, 18:06:25] avg_residual = 0.0355224609375
0it [00:00, ?it/s][Jun 10, 18:06:25] [0]                 #> Encoding 3 passages..
[Jun 10, 18:06:25] [0]           #> Saving chunk 0:      3 passages and 271 embeddings. From #0 onward.
1it [00:00, 79.91it/s]
[Jun 10, 18:06:25] [0]           #> Checking all files were saved...
[Jun 10, 18:06:25] [0]           Found all files!
[Jun 10, 18:06:25] [0]           #> Building IVF...
[Jun 10, 18:06:25] [0]           #> Loading codes...
100%|███████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 4466.78it/s]
[Jun 10, 18:06:25] [0]           Sorting codes...
[Jun 10, 18:06:25] [0]           Getting unique codes...
[Jun 10, 18:06:25] #> Optimizing IVF to store map from centroids to list of pids..
[Jun 10, 18:06:25] #> Building the emb2pid mapping..
[Jun 10, 18:06:25] len(emb2pid) = 271
100%|█████████████████████████████████████████████████████| 256/256 [00:00<00:00, 215870.89it/s]
[Jun 10, 18:06:25] #> Saved optimized IVF to data/lm_vectors/colbert/sample/corpus/indexes/nbits_2/ivf.pid.pt
[Jun 10, 18:06:25] [0]           #> Saving the indexing metadata to data/lm_vectors/colbert/sample/corpus/indexes/nbits_2/metadata.json ..
#> Joined...
Traceback (most recent call last):
  File "/home/ale/projects/HippoRAG/src/colbertv2_indexing.py", line 41, in <module>
    kb_phrase_dict = pickle.load(open(args.phrase, 'rb'))
FileNotFoundError: [Errno 2] No such file or directory: 'output/sample_facts_and_sim_graph_phrase_dict_ents_only_lower_preprocess_ner.v3.subset.p'

when calling the retriever

 File "/home/ale/projects/HippoRAG/src/hipporag.py", line 339, in load_important_files
    self.kb_phrase_dict = pickle.load(open(
FileNotFoundError: [Errno 2] No such file or directory: 'output/sample_facts_and_sim_graph_phrase_dict_ents_only_lower_preprocess_ner.v3.subset.p'

alew3 commented 3 weeks ago

OK, fixed most errors by running the script below, maybe it should be included in the setup script?

import nltk
nltk.download('stopwords')

but still got this error:

Traceback (most recent call last):
  File "/home/ale/projects/HippoRAG/src/colbertv2_indexing.py", line 41, in <module>
    kb_phrase_dict = pickle.load(open(args.phrase, 'rb'))
FileNotFoundError: [Errno 2] No such file or directory: 'output/sample_facts_and_sim_graph_phrase_dict_ents_only_lower_preprocess_ner.v3.subset.p'

alew3 commented 3 weeks ago

Could you follow NLTK instructions:
  Resource stopwords not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('stopwords')
And then see if there is still another problem? Thanks.

I replied above, fixed most errors, but still didn't run till the end.

alew3 commented 3 weeks ago

got it working by deleting everything and starting again and using the specific model of turbo gpt as in the example, it was using the llm model name in the filename for some reason.

OSU-NLP-Group / HippoRAG

Getting a lot of File Not Found Errors trying to run the setup on the sample dataset #10