Georgetown-IR-Lab / OpenNIR

An end-to-end neural ad-hoc ranking pipeline.
https://opennir.net
MIT License
150 stars 25 forks source link

Error running pipeline #6

Closed krasserm closed 4 years ago

krasserm commented 4 years ago

Thanks for the OpenNIR initiative and sharing it with the community. While trying to run

bash scripts/pipeline.sh config/robust config/vanilla_bert

on Ubuntu 18.04 with Python 3.7 I'm getting the following error after extraction of the downloaded Robust 04 Anserini index. Here's the entire output:

Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.
unable to import 'smart_open.gcs', disabling that module
[2020-04-13 07:00:49,789][trainer:pairwise][DEBUG] using GPU (deterministic)
[2020-04-13 07:00:49,800][onir.injector][DEBUG] Configuration:
 vocab       bert                                                            
-----------------------------------------------------------------------------
 bert_base   bert-base-uncased  |  bert_weights  [empty]  |  layer     -1    
 last_layer  False              |  train         True     |  encoding  joint 

 train_ds  robust                                   
----------------------------------------------------
 rankfn    bm25    |  subset  all  |  ranktopk  100 

 ranker   vanilla_transformer                                           
------------------------------------------------------------------------
 qlen     20                   |  dlen     2000  |  add_runscore  False 
 combine  linear               |  outputs  1                            

 trainer     pairwise                                                           
--------------------------------------------------------------------------------
 batch_size  16         |  batches_per_epoch  32     |  grad_acc_batch  2       
 optimizer   adam       |  lr                 0.001  |  gpu             True    
 gpu_determ  True       |  encoder_lr         2e-05  |  lossfn          softmax 
 pos_source  intersect  |  neg_source         run    |  sampling        query   
 pos_minrel  1          |  unjudged_rel       0      |  num_neg         1       
 margin      0.0                                                                

 valid_ds  robust                                   
----------------------------------------------------
 rankfn    bm25    |  subset  all  |  ranktopk  100 

 valid_pred  reranker                                                                   
----------------------------------------------------------------------------------------
 batch_size  1         |  gpu            True  |  gpu_determ  True                      
 preload     False     |  run_threshold  0     |  measures    map,ndcg,p@20,ndcg@20,mrr 
 source      run                                                                        

 test_ds  robust                                   
---------------------------------------------------
 rankfn   bm25    |  subset  all  |  ranktopk  100 

 test_pred   reranker                                                                   
----------------------------------------------------------------------------------------
 batch_size  1         |  gpu            True  |  gpu_determ  True                      
 preload     False     |  run_threshold  0     |  measures    map,ndcg,p@20,ndcg@20,mrr 
 source      run                                                                        

 pipeline      default                                                 
-----------------------------------------------------------------------
 max_epoch     1000     |  early_stop     20     |  warmup       -1    
 val_metric    p@20     |  purge_weights  True   |  test         False 
 initial_eval  False    |  skip_ds_init   False  |  only_cached  False 

[2020-04-13 07:00:49,800][valid_pred:reranker][DEBUG] using GPU (deterministic)

Will begin downloading Robust04 dataset.
Please confirm you agree to the authors' data usage stipulations found at
https://trec.nist.gov/data/cd45/index.html

answer [yes/no] yes
[2020-04-13 07:00:53,063][onir.indices.anserini][DEBUG] [starting] building /home/martin/data/onir/datasets/robust/anserini
[2020-04-13 07:00:53,064][onir.indices.anserini][DEBUG] [starting] building /home/martin/data/onir/datasets/robust/anserini.porter
[2020-04-13 07:00:53,067][onir.interfaces.java][DEBUG] [starting] initializing jnius
[2020-04-13 07:00:53,067][onir.interfaces.java][DEBUG] [starting] initializing jnius
[2020-04-13 07:00:53,108][onir.indices.sqlite][DEBUG] [starting] building /home/martin/data/onir/datasets/robust/docs.sqllite
[2020-04-13 07:00:53,631][onir.interfaces.java][DEBUG] [finished] initializing jnius [566ms]
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
[2020-04-13 07:00:53,648][onir.interfaces.java][DEBUG] [finished] initializing jnius [581ms]
[2020-04-13 07:03:50,863][onir.util.download][DEBUG] downloaded https://git.uwaterloo.ca/jimmylin/anserini-indexes/raw/master/index-robust04-20191213.tar.gz [02:49] [1.82G] [12.9MB/s] [md5 hash verified]
extracting: 3.64GB [34.17s, 107MB/s]                                                                                                       
[2020-04-13 07:04:25,043][train_ds:robust][DEBUG] [starting] documents
[2020-04-13 07:07:47,486][train_ds:robust][DEBUG] [finished] documents: [03:22] [528030it] [2608.30it/s]
Exception in thread Thread-5:
Traceback (most recent call last):
  File "/home/martin/Development/sandbox/search/georgetown/OpenNIR/onir/util/concurrency.py", line 53, in _blocking_tee_iter
    raise ctrl.ex
  File "/home/martin/Development/sandbox/search/georgetown/OpenNIR/onir/util/concurrency.py", line 53, in _blocking_tee_iter
    raise ctrl.ex
  File "/home/martin/Development/sandbox/search/georgetown/OpenNIR/onir/util/concurrency.py", line 77, in run
    self.value = next(self._it)
StopIteration

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/martin/miniconda3/envs/sandbox-opennir/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/home/martin/miniconda3/envs/sandbox-opennir/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/home/martin/Development/sandbox/search/georgetown/OpenNIR/onir/indices/sqlite.py", line 37, in build
    for doc in documents:
RuntimeError: generator raised StopIteration

Exception in thread Thread-3:
Traceback (most recent call last):
  File "/home/martin/Development/sandbox/search/georgetown/OpenNIR/onir/util/concurrency.py", line 53, in _blocking_tee_iter
    raise ctrl.ex
  File "/home/martin/Development/sandbox/search/georgetown/OpenNIR/onir/util/concurrency.py", line 53, in _blocking_tee_iter
    raise ctrl.ex
  File "/home/martin/Development/sandbox/search/georgetown/OpenNIR/onir/util/concurrency.py", line 53, in _blocking_tee_iter
    raise ctrl.ex
  File "/home/martin/Development/sandbox/search/georgetown/OpenNIR/onir/util/concurrency.py", line 77, in run
    self.value = next(self._it)
StopIteration

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/martin/miniconda3/envs/sandbox-opennir/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/home/martin/miniconda3/envs/sandbox-opennir/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/home/martin/Development/sandbox/search/georgetown/OpenNIR/onir/indices/anserini.py", line 303, in build
    for i, doc in enumerate(doc_iter):
RuntimeError: generator raised StopIteration

Exception in thread Thread-4:
Traceback (most recent call last):
  File "/home/martin/Development/sandbox/search/georgetown/OpenNIR/onir/util/concurrency.py", line 53, in _blocking_tee_iter
    raise ctrl.ex
  File "/home/martin/Development/sandbox/search/georgetown/OpenNIR/onir/util/concurrency.py", line 53, in _blocking_tee_iter
    raise ctrl.ex
  File "/home/martin/Development/sandbox/search/georgetown/OpenNIR/onir/util/concurrency.py", line 53, in _blocking_tee_iter
    raise ctrl.ex
  File "/home/martin/Development/sandbox/search/georgetown/OpenNIR/onir/util/concurrency.py", line 77, in run
    self.value = next(self._it)
StopIteration

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/martin/miniconda3/envs/sandbox-opennir/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/home/martin/miniconda3/envs/sandbox-opennir/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/home/martin/Development/sandbox/search/georgetown/OpenNIR/onir/indices/anserini.py", line 303, in build
    for i, doc in enumerate(doc_iter):
RuntimeError: generator raised StopIteration

Will begin downloading Robust04 dataset.
Please confirm you agree to the authors' data usage stipulations found at
https://trec.nist.gov/data/cd45/index.html

answer [yes/no]

After this error, I'm prompted again to agree with the Robust 04 usage terms and download starts again (causing the same error again).

krasserm commented 4 years ago

I just tried re-running it with Python 3.6 instead of 3.7 and it works. So fine for me if you want to close this ticket. In this case a hint in the documentation that Python 3.7 doesn't work would be helpful (happy to submit a PR).

seanmacavaney commented 4 years ago

Thanks! Yeah, the python version definitely needed to be documented.