dice-group / Palmetto

Palmetto is a quality measuring tool for topics
GNU Affero General Public License v3.0
209 stars 36 forks source link

Upgrading Palmetto to Lucene 6.2.1 #8

Closed cgravier closed 7 years ago

cgravier commented 7 years ago

Hello,

I am trying to upgrade Palmetto for Lucene 6.2.1. The current Lucene version clashes with another Lucene version in our classpath, and well, that's the opportunity to suggest a PR, if welcomed.

I already made some changes in a new branch in my fork (under progress) : https://github.com/cgravier/Palmetto/tree/lucene621

When I try to run mvn clean install, I get the trace below.

It seems to me that the test resource file src/test/resources/test_bd/segments_w had been created with a Lucene 3.x version.

I didn't find any sub routine to recreated those resource files (with Lucene 6.x for instance).

Is it possible to actually reconstruct those files ?

Thank you in advance,

cgravier

[INFO] Scanning for projects...
[INFO]                                                                         
[INFO] ------------------------------------------------------------------------
[INFO] Building Palmetto 0.1.2
[INFO] ------------------------------------------------------------------------
[INFO] 
[INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ palmetto ---
[INFO] Deleting /Users/cgravier/Downloads/Palmetto/palmetto/target
[INFO] 
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ palmetto ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 3 resources
[INFO] 
[INFO] --- maven-compiler-plugin:3.1:compile (default-compile) @ palmetto ---
[INFO] Changes detected - recompiling the module!
[INFO] Compiling 99 source files to /Users/cgravier/Downloads/Palmetto/palmetto/target/classes
[WARNING] /Users/cgravier/Downloads/Palmetto/palmetto/src/main/java/org/aksw/palmetto/corpus/lucene/creation/AbstractLuceneIndexCreator.java: Some input files use or override a deprecated API.
[WARNING] /Users/cgravier/Downloads/Palmetto/palmetto/src/main/java/org/aksw/palmetto/corpus/lucene/creation/AbstractLuceneIndexCreator.java: Recompile with -Xlint:deprecation for details.
[INFO] 
[INFO] --- maven-resources-plugin:2.6:testResources (default-testResources) @ palmetto ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 6 resources
[INFO] 
[INFO] --- maven-compiler-plugin:3.1:testCompile (default-testCompile) @ palmetto ---
[INFO] Changes detected - recompiling the module!
[INFO] Compiling 78 source files to /Users/cgravier/Downloads/Palmetto/palmetto/target/test-classes
[INFO] 
[INFO] --- maven-surefire-plugin:2.12.4:test (default-test) @ palmetto ---
[INFO] Surefire report directory: /Users/cgravier/Downloads/Palmetto/palmetto/target/surefire-reports

-------------------------------------------------------
 T E S T S
-------------------------------------------------------
Running org.aksw.palmetto.calculations.direct.CondProbCoherenceCalculationTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.093 sec
Running org.aksw.palmetto.calculations.direct.DifferenceBasedCoherenceCalculationTest
Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.005 sec
Running org.aksw.palmetto.calculations.direct.FitelsonCoherenceCalculationTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.008 sec
Running org.aksw.palmetto.calculations.direct.JaccardCoherenceCalculationTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.012 sec
Running org.aksw.palmetto.calculations.direct.LikelihoodCoherenceCalculationTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.009 sec
Running org.aksw.palmetto.calculations.direct.LogCondProbCoherenceCalculationTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.007 sec
Running org.aksw.palmetto.calculations.direct.LogJaccardCoherenceCalculationTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.01 sec
Running org.aksw.palmetto.calculations.direct.LogLikelihoodCoherenceCalculationTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.005 sec
Running org.aksw.palmetto.calculations.direct.LogRatioCoherenceCalculationTest
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.007 sec
Running org.aksw.palmetto.calculations.direct.NormalizedLogRatioCoherenceCalculationTest
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.012 sec
Running org.aksw.palmetto.calculations.direct.OlssonsCoherenceCalculationTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.006 sec
Running org.aksw.palmetto.calculations.direct.RatioCoherenceCalculationTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.006 sec
Running org.aksw.palmetto.calculations.direct.ShogenjisCoherenceCalculationTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.009 sec
Running org.aksw.palmetto.calculations.indirect.CentroidBasedCoherenceTest
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.012 sec
Running org.aksw.palmetto.calculations.indirect.CosinusBasedCalculationTest
Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.004 sec
Running org.aksw.palmetto.calculations.indirect.CosinusBasedCoherenceTest
Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.032 sec
Running org.aksw.palmetto.calculations.indirect.DiceBasedCalculationTest
Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.007 sec
Running org.aksw.palmetto.calculations.indirect.DiceBasedCoherenceTest
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.004 sec
Running org.aksw.palmetto.calculations.indirect.JaccardBasedCalculationTest
Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.011 sec
Running org.aksw.palmetto.calculations.indirect.JaccardBasedCoherenceTest
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.003 sec
Running org.aksw.palmetto.calculations.indirect.VectorCreationTest
Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.022 sec
Running org.aksw.palmetto.corpus.lucene.creation.PositionStoringLuceneIndexCreatorTest
2016-11-04 15:49:18,631 INFO [org.aksw.palmetto.corpus.lucene.creation.PositionStoringLuceneIndexCreator] - <Starting index creation...>
2016-11-04 15:49:18,984 INFO [org.aksw.palmetto.corpus.lucene.creation.PositionStoringLuceneIndexCreator] - <Finished index creation.>
2016-11-04 15:49:19,088 INFO [org.aksw.palmetto.corpus.lucene.creation.LuceneIndexHistogramCreator] - <Saw 3 documents.>
2016-11-04 15:49:19,089 INFO [org.aksw.palmetto.corpus.lucene.creation.LuceneIndexHistogramCreator] - <Counted 14 tokens.>
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.578 sec
Running org.aksw.palmetto.corpus.lucene.creation.SimpleLuceneIndexCreatorTest
2016-11-04 15:49:19,109 INFO [org.aksw.palmetto.corpus.lucene.creation.SimpleLuceneIndexCreator] - <Starting index creation...>
2016-11-04 15:49:19,113 INFO [org.aksw.palmetto.corpus.lucene.creation.SimpleLuceneIndexCreator] - <Finished index creation.>
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.024 sec
Running org.aksw.palmetto.corpus.lucene.LuceneCorpusAdapterForSlidingWindowsTest
Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.006 sec <<< FAILURE!
test(org.aksw.palmetto.corpus.lucene.LuceneCorpusAdapterForSlidingWindowsTest)  Time elapsed: 0.006 sec  <<< ERROR!
org.apache.lucene.index.IndexFormatTooOldException: Format version is not supported (resource BufferedChecksumIndexInput(NIOFSIndexInput(path="/Users/cgravier/Downloads/Palmetto/palmetto/src/test/resources/test_bd/segments_w"))): 0 (needs to be between 4 and 6). This version of Lucene only supports indexes created with release 5.0 and later.
    at org.apache.lucene.codecs.CodecUtil.checkHeaderNoMagic(CodecUtil.java:213)
    at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:297)
    at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:284)
    at org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:57)
    at org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:54)
    at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:685)
    at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:77)
    at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:63)
    at org.aksw.palmetto.corpus.lucene.WindowSupportingLuceneCorpusAdapter.create(WindowSupportingLuceneCorpusAdapter.java:48)
    at org.aksw.palmetto.corpus.lucene.LuceneCorpusAdapterForSlidingWindowsTest.test(LuceneCorpusAdapterForSlidingWindowsTest.java:35)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
    at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
    at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
    at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:76)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
    at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
    at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
    at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
    at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:252)
    at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:141)
    at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:112)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:189)
    at org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:165)
    at org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(ProviderFactory.java:85)
    at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:115)
    at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:75)

Running org.aksw.palmetto.corpus.lucene.SimpleAnalyzerTest
Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.014 sec
Running org.aksw.palmetto.evaluate.correlation.KendallsTauTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.005 sec
Running org.aksw.palmetto.evaluate.correlation.PearsonsSampleCorrelationCoefficientTest
Tests run: 9, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.023 sec
Running org.aksw.palmetto.evaluate.correlation.SpearmanTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.017 sec
Running org.aksw.palmetto.prob.bd.BitSetBasedBooleanDocumentFrequencyDeterminerTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.008 sec
Running org.aksw.palmetto.prob.bd.BooleanDocumentFrequencyDeterminerPerformanceTest
BooleanDocument performance test BitSetBased: 150 ms    ListBased: 189 ms
BooleanDocument performance test BitSetBased: 93 ms ListBased: 71 ms
BooleanDocument performance test BitSetBased: 90 ms ListBased: 78 ms
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.731 sec
Running org.aksw.palmetto.prob.bd.BooleanDocumentProbabilitySupplierTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.003 sec
Running org.aksw.palmetto.prob.BooleanSlidingWindowFrequencyDeterminerCountingTest
Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.017 sec
Running org.aksw.palmetto.prob.BooleanSlidingWindowFrequencyDeterminerSumCreationTest
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.002 sec
Running org.aksw.palmetto.prob.BooleanSlidingWindowProbabilitySupplierTest
Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.005 sec
Running org.aksw.palmetto.prob.ContextWindowFrequencyDeterminerCountingTest
Tests run: 9, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.012 sec
Running org.aksw.palmetto.prob.decorator.FrequencyCachingDeterminerDecoratorTest
2016-11-04 15:49:20,048 INFO [org.aksw.palmetto.prob.decorator.FrequencyCachingDeterminerDecoratorTest] - <Testing cache...>
2016-11-04 15:49:20,736 INFO [org.aksw.palmetto.prob.decorator.FrequencyCachingDeterminerDecoratorTest] - <Testing cache...>
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.594 sec
Running org.aksw.palmetto.prob.ListBasedBooleanDocumentFrequencyDeterminerTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.002 sec
Running org.aksw.palmetto.subsets.AllAllTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0 sec
Running org.aksw.palmetto.subsets.AllOneTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0 sec
Running org.aksw.palmetto.subsets.AnyAnyTest
Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.003 sec
Running org.aksw.palmetto.subsets.OneAllTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0 sec
Running org.aksw.palmetto.subsets.OneAnyTest
Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0 sec
Running org.aksw.palmetto.subsets.OneOneAndSelfTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0 sec
Running org.aksw.palmetto.subsets.OneOneTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.011 sec
Running org.aksw.palmetto.subsets.OnePrecedingTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0 sec
Running org.aksw.palmetto.subsets.OneSetTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0 sec
Running org.aksw.palmetto.subsets.OneSubsequentTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0 sec
Running org.aksw.palmetto.sum.ArithmeticMeanTest
Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.001 sec
Running org.aksw.palmetto.sum.GeometricMeanTest
Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0 sec
Running org.aksw.palmetto.sum.HarmonicMeanTest
Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.001 sec
Running org.aksw.palmetto.sum.MaxTest
Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0 sec
Running org.aksw.palmetto.sum.MedianTest
Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.009 sec
Running org.aksw.palmetto.sum.MinTest
Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.006 sec
Running org.aksw.palmetto.sum.QuadraticMeanTest
Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.01 sec
Running org.aksw.palmetto.vector.CondProbCalculationBasedCreatorTest
Tests run: 16, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.008 sec
Running org.aksw.palmetto.vector.DifferenceCalculationBasedCreatorTest
Tests run: 16, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.02 sec
Running org.aksw.palmetto.vector.FitelsonCalculationBasedCreatorTest
Tests run: 16, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.006 sec
Running org.aksw.palmetto.vector.JaccardCalculationBasedCreatorTest
Tests run: 16, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.011 sec
Running org.aksw.palmetto.vector.LikelihoodCalculationBasedCreatorTest
Tests run: 20, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.009 sec
Running org.aksw.palmetto.vector.LogCondProbCalculationBasedCreatorTest
Tests run: 16, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.012 sec
Running org.aksw.palmetto.vector.LogJaccardCoherenceCalculationTest
Tests run: 16, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.023 sec
Running org.aksw.palmetto.vector.LogLikelihoodCalculationBasedCreatorTest
Tests run: 20, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.013 sec
Running org.aksw.palmetto.vector.LogRatioCalculationBasedCreatorTest
Tests run: 20, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.013 sec
Running org.aksw.palmetto.vector.NormalizedLogRatioCalculationBasedCreatorTest
Tests run: 20, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.017 sec
Running org.aksw.palmetto.vector.ProbabilityBasedVectorCreatorTest
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.001 sec
Running org.aksw.palmetto.vector.RatioCoherenceCalculationTest
Tests run: 20, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.014 sec
Running org.aksw.palmetto.weight.CompleteProbabilityBasedWeighterTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.002 sec
Running org.aksw.palmetto.weight.ConditionalProbabilityBasedWeighterTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.001 sec
Running org.aksw.palmetto.weight.MarginalProbabilityBasedWeighterTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.002 sec
Running org.aksw.palmetto.weight.WordSetSizeBasedWeighterTest
Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.006 sec

Results :

Tests in error: 
  test(org.aksw.palmetto.corpus.lucene.LuceneCorpusAdapterForSlidingWindowsTest): Format version is not supported (resource BufferedChecksumIndexInput(NIOFSIndexInput(path="/Users/cgravier/Downloads/Palmetto/palmetto/src/test/resources/test_bd/segments_w"))): 0 (needs to be between 4 and 6). This version of Lucene only supports indexes created with release 5.0 and later.

Tests run: 427, Failures: 0, Errors: 1, Skipped: 0

[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 9.645 s
[INFO] Finished at: 2016-11-04T15:49:21+01:00
[INFO] Final Memory: 22M/272M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.12.4:test (default-test) on project palmetto: There are test failures.
[ERROR] 
[ERROR] Please refer to /Users/cgravier/Downloads/Palmetto/palmetto/target/surefire-reports for the individual test results.
[ERROR] -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
MichaelRoeder commented 7 years ago

Hi cgravier,

thanks for your effort to integrate Lucene 6.2.1 into Palmetto. A PR would be very helpful.

Unfortunately, the data that I used for creating the test index is not available, anymore. However, the testcase was not well designed and I thank that I can create a better one with some other data.

The Wikipedia Index that I am using was created using Lucene 4. Do you know whether Lucene 4.x.x indexes are too old to be open with Lucene 6.2.1?

Cheers, Michael Röder

cgravier commented 7 years ago

Hi Michael,

I don't think Lucene 4.x would work here, the error message indicates: This version of Lucene only supports indexes created with release 5.0 and later.

Since this would prevail any further Lucene updates after Lucene 4.x, if you can write a better unit test, I think this is the best way to go (but with the more work for you :s)

Christophe

MichaelRoeder commented 7 years ago

I see. That means that I can merge the PR with the master but can not release it because it wouldn't work with my current Wikipedia index.

However, I should update the index anyway ;)

So feel free to add @Ignore to the test case that needs further work and create the PR.

cgravier commented 7 years ago

Yes ;-)

Good point on @Ignore

That said, do you have an idea on how well the unit test cover Lucene-related features ? (if the fork passes all tests, what's the rough probability not to have broken something ;-))

MichaelRoeder commented 7 years ago

A very good question. There is one test case that makes sure that the StandardAnalyzer of Lucene is working as expected. More important are the two test classes in this package https://github.com/AKSW/Palmetto/tree/master/palmetto/src/test/java/org/aksw/palmetto/corpus/lucene/creation They create a temporary index, query it and compare it with the expected responses. From that point of view, I would argue that the failing LuceneCorpusAdapterForSlidingWindowsTest is not needed anymore because the PositionStoringLuceneIndexCreatorTest already tests the same piece of code in a better way.

cgravier commented 7 years ago

OK. So far, so good : Tests run: 427, Failures: 0, Errors: 0, Skipped: 1

All tests but LuceneCorpusAdapterForSlidingWindowsTest are successful 👍

I made several changes in different files, but I wasn't familiar so far with Lucene API, especially 6.2.1. so I hope other untested things aren't scew up. Anyway, I can surely prepare a PR.

So far, I bumped Palmetto version to <version>0.1.2</version> - I wanted to keep track of the version I was working with.

Is this acceptable or do you prefer a PR with version 0.1.1 and you make a new commit to tag the new version ?

Meantime, I bumped also java to <java.version>1.8</java.version> in the pom.xml. (I had some java version issue). Do you wish to take opportunity to update this as well, are should I PR with a 1.7 ?

MichaelRoeder commented 7 years ago

Version 0.1.2 and Java 1.8 are fine. Thanks!

MichaelRoeder commented 7 years ago

Changes merged (#9).

Thanks again for your efforts.

cgravier commented 7 years ago

That's me who has to thank you. We are some academic researchers who are happy to come up with Palmetto implementation, and contributing when possible is natural.

Thanks!

Velcin commented 7 years ago

Now that Palmetto works with lucene 6.2.1 (thank you Christophe!), I need to update the index based on wikipedia (in English, but later for other languages, such as French).

Is there anything I should know before processing the huge wikipedia file? I mean in the way Palmetto uses the lucene index?

Julien

MichaelRoeder commented 7 years ago

1) You will have to preprocess the Wikipedia, i.e., remove all markups and stuff like that (but I think that you are aware of that). This can be a very annoying task since some markups can become very complicated. We decided to simply remove all markups (including tables). Additionally you should remove pages that are not helpful to calculate the coherences as described in the paper:

All corpora as well as the complete Wikipedia used as reference corpus are preprocessed using lemmatization and stop word removal. Additionally, we removed portal and category articles, redirection and disambiguation pages as well as articles about single years.

2) You have to decide whether you would like to use a lemmatizer in your text preprocessing or not. This is an important decision because it has a direct influence of the cooccurence of your words. So it depends on the preprocessing you are using for your topic modeling algorithm. If you are using topic modeling on lemmatized words, the index should contain lemmatized forms as well. If you are not using a lemmatizer, you shouldn't use it for the index creation. 3) In most cases you will have to create an index that contains the positions of the words inside the documents. This is needed to use more complex coherences like UCI or NPMI. The creation of the index is described in this wiki article.

Feel free to ask questions if you encounter problems.

If you are creating new indexes, you might want to share them with the community as van der Zwaan et al. did who gave me a Dutch index for hosting it.

Velcin commented 7 years ago

Thank you Michael! I'll give a try very soon. In order to help me save time, do you have some advice to handle the very huge wikipedia dump (~56Go unzipped), extract the needed information and build the lucene index?

MichaelRoeder commented 7 years ago

I separated the dump into smaller parts of 10.000 documents per file. This has the advantage that you can a) process the data on several machines in parallel and b) have a better control which parts of the dump have been processed in case the preprocessing crashes. Additionally, I used a pipeline comprising multiple single steps that where executed as single programs and stored the intermediate results. This makes the complete processing slower because you have to start the single steps manually but I think that it was a good idea because I was able to react to problems, e.g., if the lemmatizer encountered a problem, I could adapt this step and restart it without influencing the other parts of the pipeline. I think that might not have been very easy if the complete preprocessing has been written down in one single program.

I think the general preprocessing workflow is straight forward:

  1. separate the dump (e.g., in smaller XML files)
  2. parse the wikipedia XML
    • filter documents that are single years, categories, portals, redirects, ...
    • remove wiki markup
  3. preprocess the remaining documents
    • POS tagging, lemmatizing
    • remove stop words
  4. create the index from the preprocessed documents as described in the wiki
Velcin commented 7 years ago

Thank you again for the quickness of your answer. I'll test this very soon and let you know. I'm just wondering how long those operations will take on my lifetime :).

Velcin commented 7 years ago

I've found the following script: https://github.com/attardi/wikiextractor It works very well for splitting the big xml files into manageable files. However, I don't know how to filter out categories, portal, redirect... pages from the extracted docs. Any idea?

MichaelRoeder commented 7 years ago

I tried to find the piece of code that made the filtering to give you a complete list of title prefixes but couldn't find it until now. Unfortunately, I am busy with another project at the moment and can not invest much time in that.

However, filtering is pretty easy. First, you should check the title because there are prefixes, that show that the page should be filtered, like Category:,Wikipedia:,Portal: and so on. The years contain only numbers (and maybe a BC at the end of the title, e.g., 1 BC. Redirects can be found in two ways. First, an article can start with #REDIRECT and the title it is redirecting to. Second, the XML schema that is used for the wikipedia dumps contained the information whether an article is a redirect.

MichaelRoeder commented 7 years ago

There is a list of portals https://en.wikipedia.org/wiki/Portal:Contents/Portals and a list of categories https://en.wikipedia.org/wiki/Portal:Contents/Categories

I think they might be very. However, the prefix-based approach described above is easier to implement and should work as well.

Velcin commented 7 years ago

Thank you so much. Therefore, I guess I can simply post-process the output of wikiextractor based on the titles only with both reg exp + lists of portals and categories. Unfortunately, there is no possibility to access the xml tags with this script. I plan to test this very soon and let you know.

Velcin commented 7 years ago

Here I am: I succeeded in getting the good wikipedia dumps, splitting them, cleaning and indexing. Thank you for your help! I suggest that you update the weblink to the following url: http://mediamining.univ-lyon2.fr/velcin/public/misc/wiki/en_index.tar.gz (it's calculated from a dump that dates back to the 1st of May 2016)

By the way, I've three remarks:

  1. I had to move the histogram file (in my case, index.histogram) into the index folder and rename it to ".histogram". What is written in the "How to create a new index" webpage is the opposite.

  2. I tested the 6 measures on a small benchmark of mine. I briefly compared the umass values got with my index and the ones got on the demo webpage. The values can be very different, but it's probably due to the changing of wiki pages (for instance, a topic on "pokemon go" got a high value, even though it's a dump from last may).

  3. Impossible to calculate the measures C_A and C_V. It seems there is some bug, if you have an idea...

13:13:02.631 [main] INFO  org.aksw.palmetto.Palmetto - Read 12 from file.
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 256
    at org.aksw.palmetto.calculations.direct.NormalizedLogRatioConfirmationMeasure.calculateConfirmationValues(NormalizedLogRatioConfirmationMeasure.java:61)
    at org.aksw.palmetto.vector.DirectConfirmationBasedVectorCreator.createVectors(DirectConfirmationBasedVectorCreator.java:72)
    at org.aksw.palmetto.vector.AbstractVectorCreator.getVectors(AbstractVectorCreator.java:43)
    at org.aksw.palmetto.VectorBasedCoherence.calculateCoherences(VectorBasedCoherence.java:86)
    at exe.PalmettoTest.main(PalmettoTest.java:55)
MichaelRoeder commented 7 years ago

What are the details of the index creation? Which NLP library did you use to process the documents? Did you filter stop words? Which stop word list did you use? Did you lemmatized the words?

  1. That is surprising. I just tested it with the index you send me. There, the directory index and index.histogram are in the same directory and it is working as expected.

  2. I don't have problems with calculating c_a or c_v on my local machine with the index you provided. From the error I assume, that the topics you are trying to rate create this problem. Can you please make sure that all topics have exactly the same number of words? Maybe you can provide the topics so I can take a deeper look into the problem.

Velcin commented 7 years ago

Here are the details without using any nlp library:

For French, I hesitated between keeping and replacing accented characters (e.g., é -> e). For now, I keep the accents. If you like, you can provide the following weblink:

http://mediamining.univ-lyon2.fr/velcin/public/misc/wiki/fr_index.tar.gz

For (1), it is surprising. I confirm that, in my case, I have to move the index.histogram to index/.histogram.

For (2), you're right: if I use the same number of words for each topic, there is no problem for the two measures anymore!

Thank you for the useful help :).