Georgetown-IR-Lab / cedr

Code for CEDR: Contextualized Embeddings for Document Ranking, accepted at SIGIR 2019.
MIT License
155 stars 28 forks source link

Question about running extract_docs_from_index.py #37

Open yiyaxiaozhi opened 3 years ago

yiyaxiaozhi commented 3 years ago

I try to run the extract_docs_from_index.py with this command and the index is pre-index provided by Pyserini: awk '{print $3}' data/robust/*.run | python extract_docs_from_index.py lucene index-robust04-20191213/ > data/robust/documents.tsv

but I get an error: image and I do not change any code in the file.

my java version is: openjdk version "1.8.0_282" OpenJDK Runtime Environment (build 1.8.0_282-8u282-b08-0ubuntu1~20.04-b08) OpenJDK 64-Bit Server VM (build 25.282-b08, mixed mode) Do I have the correct java?

Could you give some advice on this error? Thanks a lot!


I index the Robust04 document files myself and run the extract_docs_from_index.py successfully! Then I check the document.tsv file with pandas package and found that there are 73855 records here. I don't know how many files should be there and I appreciate that if you can tell me the correct number of records here!

WHU-ZQH commented 3 years ago

I have also problems while running the extract_docs_from_index.py. Could you please release the document.tsv files?

seanmacavaney commented 3 years ago

We cannot release the dataset directly due to the data usage agreement. However, I could provide a script that builds the file from the ir-datasets package, if that would help? Note that for this to work, you would need the original dataset source files.

Let me know if this is something you'd want.

WHU-ZQH commented 3 years ago

I try to run the extract_docs_from_index.py with this command and the index is pre-index provided by Pyserini: awk '{print $3}' data/robust/*.run | python extract_docs_from_index.py lucene index-robust04-20191213/ > data/robust/documents.tsv

but I get an error: image and I do not change any code in the file.

my java version is: openjdk version "1.8.0_282" OpenJDK Runtime Environment (build 1.8.0_282-8u282-b08-0ubuntu1~20.04-b08) OpenJDK 64-Bit Server VM (build 25.282-b08, mixed mode) Do I have the correct java?

Could you give some advice on this error? Thanks a lot!

I index the Robust04 document files myself and run the extract_docs_from_index.py successfully! Then I check the document.tsv file with pandas package and found that there are 73855 records here. I don't know how many files should be there and I appreciate that if you can tell me the correct number of records here!

I have the same problem with you. Could you please tell me how to solve it?

seanmacavaney commented 3 years ago

The error says that the index was created with a newer Lucene version than the current software supports. I think you should be able to add a codecs JAR to your CLASSPATH to overcome this. Here's one that might work: https://github.com/Georgetown-IR-Lab/OpenNIR/blob/master/onir/resources/lucene-backward-codecs-8.0.0.jar

You'll probably need to add it to the classpath here: https://github.com/Georgetown-IR-Lab/cedr/blob/master/cedr/extract_docs_from_index.py#L25

Let me know if this helps!

WHU-ZQH commented 3 years ago

The error says that the index was created with a newer Lucene version than the current software supports. I think you should be able to add a codecs JAR to your CLASSPATH to overcome this. Here's one that might work: https://github.com/Georgetown-IR-Lab/OpenNIR/blob/master/onir/resources/lucene-backward-codecs-8.0.0.jar

You'll probably need to add it to the classpath here: https://github.com/Georgetown-IR-Lab/cedr/blob/master/cedr/extract_docs_from_index.py#L25

Let me know if this helps!

Thanks very much! I had addressed the problem successfully, but I still have a question... Specifically, how do you get the "train_pairs" in your study?

Akakaala commented 2 years ago

我尝试使用此命令运行extract_docs_from_index.py 并且索引是由Pyserini 提供的预索引: awk '{print $3}' data/robust/*.run | python extract_docs_from_index.py lucene index-robust04-20191213/ > data/robust/documents.tsv

但我收到一个错误: 我没有更改文件中的任何代码。 图片

我的Java版本是: OpenJDK的版本“1.8.0_282” 的OpenJDK运行时环境(编译1.8.0_282-8u282-b08-0ubuntu1〜20.04-B08) OpenJDK的64位服务器VM(编译25.282-B08,混合模式) 我是否有正确的java?

你能就这个错误给出一些建议吗? 非常感谢!

我自己索引了 Robust04 文档文件并成功运行了 extract_docs_from_index.py! 然后我用pandas包查看document.tsv文件,发现这里有73855条记录。我不知道那里应该有多少个文件,如果您能告诉我正确的记录数,我将不胜感激!

您好,能看一下您处理的document.tsv文件的样例吗?我没有拿到完整的文件,所以不知道数据应该处理成什么样子

yysirs commented 2 years ago

Hi,Can you share the files under your index-robust04-20191213 folder, please?