castorini / anserini

Anserini is a Lucene toolkit for reproducible information retrieval research
http://anserini.io/
Apache License 2.0
1.02k stars 449 forks source link

Setup required before starting indexing. #1107

Closed rossbrown9879 closed 4 years ago

rossbrown9879 commented 4 years ago

I'm new to this project and I want to build index on MS-MARCO documents dataset, for document ranking task.

The readme mentions using following command to start indexing.

nohup sh target/appassembler/bin/IndexCollection -collection TrecCollection \
 -generator LuceneDocumentGenerator -threads 1 -input msmarco-doc/collection \
 -index lucene-index.msmarco-doc.pos+docvectors+rawdocs -storePositions -storeDocvectors -storeRawDocs \
 >& log.msmarco-doc.pos+docvectors+rawdocs &

I don't know much about this script. I ran the same command and got the following in one newly created file named log.msmarco-doc.pos+docvectors+rawdocs :

nohup: ignoring input
sh: 0: Can't open target/appassembler/bin/IndexCollection

What I understood from reading this is, there was something missing like appassembler. I used following command

$ mvn clean package appassembler:assemble

This got me following :

WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by com.google.inject.internal.cglib.core.$ReflectUtils$1 (file:/usr/share/maven/lib/guice.jar) to method java.lang.ClassLoader.defineClass(java.lang.String,byte[],int,int,java.security.ProtectionDomain)
WARNING: Please consider reporting this to the maintainers of com.google.inject.internal.cglib.core.$ReflectUtils$1
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
[INFO] Scanning for projects...
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  0.078 s
[INFO] Finished at: 2020-04-20T15:33:58+05:30
[INFO] ------------------------------------------------------------------------
[ERROR] The goal you specified requires a project to execute but there is no POM in this directory (/home/ruchit/Desktop). Please verify you invoked Maven from the correct directory. -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MissingProjectException

I don't know what that is, and now I do not know any solution to this. Can anyone provide steps that need to follow for building an inverted index?

Also I'd like to know if there are any prebuilt indexes on MS-MARCO document ranking dataset. I saw one index for passage ranking dataset. If there are any index already built on MS-MARCO document ranking dataset, please mention its source. Thanks in advance.

lintool commented 4 years ago

What java version are you using?

rossbrown9879 commented 4 years ago

@lintool thanks for quick reply. I'm using following java version.

openjdk version "11.0.6" 2020-01-14
OpenJDK Runtime Environment (build 11.0.6+10-post-Ubuntu-1ubuntu119.10.1)
OpenJDK 64-Bit Server VM (build 11.0.6+10-post-Ubuntu-1ubuntu119.10.1, mixed mode, sharing)

Looks like I've progressed a bit on this. I'm sorry for my confusion. What I've done till now is : (1) Cloned this repo to Desktop (2) cd Desktop/anerini (3) Created a target folder using mvn clean appassembler:assemble (4) Created a directory msmarco-doc/collection, and downloaded zipped corpus in this folder. (5) Ran the script for indexing.

But still I'm getting following in the output file.

nohup: ignoring input
Error: Could not find or load main class io.anserini.index.IndexCollection
Caused by: java.lang.ClassNotFoundException: io.anserini.index.IndexCollection

Earlier when I reported this issue, I did not clone the repository, I was just running mvn clean appassembler:assemble. Due to this, I could not create that target folder. Now I did that part. But still that ClassNotFoundException is arising. Thank you again.

rossbrown9879 commented 4 years ago

When I deleted previously created target folder and ran `mvn clean package appassembler:assemble

` My build failed and I got following :


Results :

Tests run: 251, Failures: 0, Errors: 0, Skipped: 0

[INFO] 
[INFO] --- jacoco-maven-plugin:0.8.2:report (report) @ anserini ---
[INFO] Loading execution data file /home/ruchit/Desktop/anserini/target/jacoco.exec
[INFO] Analyzed bundle 'Anserini' with 265 classes
[INFO] 
[INFO] --- maven-jar-plugin:2.4:jar (default-jar) @ anserini ---
[INFO] Building jar: /home/ruchit/Desktop/anserini/target/anserini-0.9.1-SNAPSHOT.jar
[INFO] 
[INFO] --- maven-javadoc-plugin:3.1.0:jar (attach-javadocs) @ anserini ---
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  03:54 min
[INFO] Finished at: 2020-04-20T17:58:28+05:30
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-javadoc-plugin:3.1.0:jar (attach-javadocs) on project anserini: MavenReportException: Error while generating Javadoc: Unable to find javadoc command: The environment variable JAVA_HOME is not correctly set. -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
[1]+  Exit 1                  nohup sh target/appassembler/bin/IndexCollection -collection TrecCollection -generator LuceneDocumentGenerator -threads 1 -input msmarco-doc/collection -index lucene-index.msmarco-doc.pos+docvectors+rawdocs -storePositions -storeDocvectors -storeRawDocs &> log.msmarco-doc.pos+docvectors+rawdocs
lintool commented 4 years ago

Your build is failing:

[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  03:54 min
[INFO] Finished at: 2020-04-20T17:58:28+05:30
[INFO] ------------------------------------------------------------------------

Scroll up further up in your Maven output to see why?

rossbrown9879 commented 4 years ago

@lintool The part above Results :, in the log I provided in comment above contains list of tests like this.

Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 18.181 sec
Running io.anserini.analysis.EnglishStemmingAnalyzerTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.002 sec
Running io.anserini.analysis.TweetTokenizationTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.016 sec
Running io.anserini.doc.JDIQ2018EffectivenessDocsTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.034 sec
Running io.anserini.doc.GenerateRegressionDocsTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.542 sec
Running io.anserini.kg.FreebaseTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.006 sec
Running io.anserini.kg.FreebaseNodeTest
Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.004 sec
Running io.anserini.util.ExtractAverageDocumentLengthTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.254 sec
Running io.anserini.util.ExtractDocumentLengthsTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.255 sec
Running io.anserini.util.FeatureVectorTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.259 sec
Running io.anserini.util.ExtractTopDfTermsTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.261 sec
Running io.anserini.util.ExtractNormsTest

I see that none of the test have any failures. The part before those tests contain following warnings :

WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by com.google.inject.internal.cglib.core.$ReflectUtils$1 (file:/usr/share/maven/lib/guice.jar) to method java.lang.ClassLoader.defineClass(java.lang.String,byte[],int,int,java.security.ProtectionDomain)
WARNING: Please consider reporting this to the maintainers of com.google.inject.internal.cglib.core.$ReflectUtils$1
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
[INFO] Scanning for projects...
[INFO] 
[INFO] ------------------------< io.anserini:anserini >------------------------
[INFO] Building Anserini 0.9.1-SNAPSHOT
[INFO] --------------------------------[ jar ]---------------------------------
[INFO] 
[INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ anserini ---
[INFO] Deleting /home/ruchit/Desktop/anserini/target
[INFO] 
[INFO] --- jacoco-maven-plugin:0.8.2:prepare-agent (default) @ anserini ---
[INFO] argLine set to -javaagent:/home/ruchit/.m2/repository/org/jacoco/org.jacoco.agent/0.8.2/org.jacoco.agent-0.8.2-runtime.jar=destfile=/home/ruchit/Desktop/anserini/target/jacoco.exec
[INFO] 
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ anserini ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 200 resources
[INFO] 
[INFO] --- maven-compiler-plugin:3.8.1:compile (default-compile) @ anserini ---
[INFO] Changes detected - recompiling the module!
[INFO] Compiling 147 source files to /home/ruchit/Desktop/anserini/target/classes
[INFO] 
[INFO] --- maven-resources-plugin:2.6:testResources (default-testResources) @ anserini ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 40 resources
[INFO] 
[INFO] --- maven-compiler-plugin:3.8.1:testCompile (default-testCompile) @ anserini ---
[INFO] Changes detected - recompiling the module!
[INFO] Compiling 91 source files to /home/ruchit/Desktop/anserini/target/test-classes
[INFO] 
[INFO] --- maven-surefire-plugin:2.12.4:test (default-test) @ anserini ---
[INFO] Surefire report directory: /home/ruchit/Desktop/anserini/target/surefire-reports

-------------------------------------------------------
 T E S T S
-------------------------------------------------------

I do not see any log mentioning cause of failure of the build. I can not understand the following mentioned when BUILD FAILURE is logged at the end.

[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  05:57 min
[INFO] Finished at: 2020-04-20T18:12:06+05:30
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-javadoc-plugin:3.1.0:jar (attach-javadocs) on project anserini: MavenReportException: Error while generating Javadoc: Unable to find javadoc command: The environment variable JAVA_HOME is not correctly set. -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
[1]+  Exit 1                  nohup sh target/appassembler/bin/IndexCollection -collection TrecCollection -generator LuceneDocumentGenerator -threads 1 -input msmarco-doc/collection -index lucene-index.msmarco-doc.pos+docvectors+rawdocs -storePositions -storeDocvectors -storeRawDocs &> log.msmarco-doc.pos+docvectors+rawdocs
rossbrown9879 commented 4 years ago

I think the build is failed because of Failed to execute goal org.apache.maven.plugins:maven-javadoc-plugin:3.1.0:ja But I'm not sure about this. Is it anything related to my java version?

I think that this is because of environment variable JAVA_HOME is not set. Can you tell me the value to which I should set this environment variable?

lintool commented 4 years ago

What's the error associated with org.apache.maven.plugins:maven-javadoc-plugin:3.1.0?

rossbrown9879 commented 4 years ago

@lintool following is the error [ERROR] Failed to execute goal org.apache.maven.plugins:maven-javadoc-plugin:3.1.0:jar (attach-javadocs) on project anserini: MavenReportException: Error while generating Javadoc: Unable to find javadoc command: The environment variable JAVA_HOME is not correctly set. -> [Help 1] [ERROR]

lintool commented 4 years ago

Yes, you should set your JAVA_HOME appropriately then. Not knowing the specifics of your setup, it would be easier to search online to find out how to do so...

lintool commented 4 years ago

Having heard no follow up, closing issue. Reopen if necessary.