Closed cgravier closed 7 years ago
Hi cgravier,
thanks for your effort to integrate Lucene 6.2.1 into Palmetto. A PR would be very helpful.
Unfortunately, the data that I used for creating the test index is not available, anymore. However, the testcase was not well designed and I thank that I can create a better one with some other data.
The Wikipedia Index that I am using was created using Lucene 4. Do you know whether Lucene 4.x.x indexes are too old to be open with Lucene 6.2.1?
Cheers, Michael Röder
Hi Michael,
I don't think Lucene 4.x would work here, the error message indicates:
This version of Lucene only supports indexes created with release 5.0 and later.
Since this would prevail any further Lucene updates after Lucene 4.x, if you can write a better unit test, I think this is the best way to go (but with the more work for you :s)
Christophe
I see. That means that I can merge the PR with the master but can not release it because it wouldn't work with my current Wikipedia index.
However, I should update the index anyway ;)
So feel free to add @Ignore
to the test case that needs further work and create the PR.
Yes ;-)
Good point on @Ignore
That said, do you have an idea on how well the unit test cover Lucene-related features ? (if the fork passes all tests, what's the rough probability not to have broken something ;-))
A very good question. There is one test case that makes sure that the StandardAnalyzer
of Lucene is working as expected. More important are the two test classes in this package https://github.com/AKSW/Palmetto/tree/master/palmetto/src/test/java/org/aksw/palmetto/corpus/lucene/creation
They create a temporary index, query it and compare it with the expected responses. From that point of view, I would argue that the failing LuceneCorpusAdapterForSlidingWindowsTest is not needed anymore because the PositionStoringLuceneIndexCreatorTest already tests the same piece of code in a better way.
OK. So far, so good :
Tests run: 427, Failures: 0, Errors: 0, Skipped: 1
All tests but LuceneCorpusAdapterForSlidingWindowsTest are successful 👍
I made several changes in different files, but I wasn't familiar so far with Lucene API, especially 6.2.1. so I hope other untested things aren't scew up. Anyway, I can surely prepare a PR.
So far, I bumped Palmetto version to <version>0.1.2</version>
- I wanted to keep track of the version I was working with.
Is this acceptable or do you prefer a PR with version 0.1.1
and you make a new commit to tag the new version ?
Meantime, I bumped also java to <java.version>1.8</java.version>
in the pom.xml
. (I had some java version issue). Do you wish to take opportunity to update this as well, are should I PR with a 1.7 ?
Version 0.1.2 and Java 1.8 are fine. Thanks!
Changes merged (#9).
Thanks again for your efforts.
That's me who has to thank you. We are some academic researchers who are happy to come up with Palmetto implementation, and contributing when possible is natural.
Thanks!
Now that Palmetto works with lucene 6.2.1 (thank you Christophe!), I need to update the index based on wikipedia (in English, but later for other languages, such as French).
Is there anything I should know before processing the huge wikipedia file? I mean in the way Palmetto uses the lucene index?
Julien
1) You will have to preprocess the Wikipedia, i.e., remove all markups and stuff like that (but I think that you are aware of that). This can be a very annoying task since some markups can become very complicated. We decided to simply remove all markups (including tables). Additionally you should remove pages that are not helpful to calculate the coherences as described in the paper:
All corpora as well as the complete Wikipedia used as reference corpus are preprocessed using lemmatization and stop word removal. Additionally, we removed portal and category articles, redirection and disambiguation pages as well as articles about single years.
2) You have to decide whether you would like to use a lemmatizer in your text preprocessing or not. This is an important decision because it has a direct influence of the cooccurence of your words. So it depends on the preprocessing you are using for your topic modeling algorithm. If you are using topic modeling on lemmatized words, the index should contain lemmatized forms as well. If you are not using a lemmatizer, you shouldn't use it for the index creation. 3) In most cases you will have to create an index that contains the positions of the words inside the documents. This is needed to use more complex coherences like UCI or NPMI. The creation of the index is described in this wiki article.
Feel free to ask questions if you encounter problems.
If you are creating new indexes, you might want to share them with the community as van der Zwaan et al. did who gave me a Dutch index for hosting it.
Thank you Michael! I'll give a try very soon. In order to help me save time, do you have some advice to handle the very huge wikipedia dump (~56Go unzipped), extract the needed information and build the lucene index?
I separated the dump into smaller parts of 10.000 documents per file. This has the advantage that you can a) process the data on several machines in parallel and b) have a better control which parts of the dump have been processed in case the preprocessing crashes. Additionally, I used a pipeline comprising multiple single steps that where executed as single programs and stored the intermediate results. This makes the complete processing slower because you have to start the single steps manually but I think that it was a good idea because I was able to react to problems, e.g., if the lemmatizer encountered a problem, I could adapt this step and restart it without influencing the other parts of the pipeline. I think that might not have been very easy if the complete preprocessing has been written down in one single program.
I think the general preprocessing workflow is straight forward:
Thank you again for the quickness of your answer. I'll test this very soon and let you know. I'm just wondering how long those operations will take on my lifetime :).
I've found the following script: https://github.com/attardi/wikiextractor It works very well for splitting the big xml files into manageable files. However, I don't know how to filter out categories, portal, redirect... pages from the extracted docs. Any idea?
I tried to find the piece of code that made the filtering to give you a complete list of title prefixes but couldn't find it until now. Unfortunately, I am busy with another project at the moment and can not invest much time in that.
However, filtering is pretty easy. First, you should check the title because there are prefixes, that show that the page should be filtered, like Category:
,Wikipedia:
,Portal:
and so on.
The years contain only numbers (and maybe a BC
at the end of the title, e.g., 1 BC.
Redirects can be found in two ways. First, an article can start with #REDIRECT
and the title it is redirecting to. Second, the XML schema that is used for the wikipedia dumps contained the information whether an article is a redirect.
There is a list of portals https://en.wikipedia.org/wiki/Portal:Contents/Portals and a list of categories https://en.wikipedia.org/wiki/Portal:Contents/Categories
I think they might be very. However, the prefix-based approach described above is easier to implement and should work as well.
Thank you so much. Therefore, I guess I can simply post-process the output of wikiextractor based on the titles only with both reg exp + lists of portals and categories. Unfortunately, there is no possibility to access the xml tags with this script. I plan to test this very soon and let you know.
Here I am: I succeeded in getting the good wikipedia dumps, splitting them, cleaning and indexing. Thank you for your help! I suggest that you update the weblink to the following url: http://mediamining.univ-lyon2.fr/velcin/public/misc/wiki/en_index.tar.gz (it's calculated from a dump that dates back to the 1st of May 2016)
By the way, I've three remarks:
I had to move the histogram file (in my case, index.histogram) into the index folder and rename it to ".histogram". What is written in the "How to create a new index" webpage is the opposite.
I tested the 6 measures on a small benchmark of mine. I briefly compared the umass values got with my index and the ones got on the demo webpage. The values can be very different, but it's probably due to the changing of wiki pages (for instance, a topic on "pokemon go" got a high value, even though it's a dump from last may).
Impossible to calculate the measures C_A and C_V. It seems there is some bug, if you have an idea...
13:13:02.631 [main] INFO org.aksw.palmetto.Palmetto - Read 12 from file.
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 256
at org.aksw.palmetto.calculations.direct.NormalizedLogRatioConfirmationMeasure.calculateConfirmationValues(NormalizedLogRatioConfirmationMeasure.java:61)
at org.aksw.palmetto.vector.DirectConfirmationBasedVectorCreator.createVectors(DirectConfirmationBasedVectorCreator.java:72)
at org.aksw.palmetto.vector.AbstractVectorCreator.getVectors(AbstractVectorCreator.java:43)
at org.aksw.palmetto.VectorBasedCoherence.calculateCoherences(VectorBasedCoherence.java:86)
at exe.PalmettoTest.main(PalmettoTest.java:55)
What are the details of the index creation? Which NLP library did you use to process the documents? Did you filter stop words? Which stop word list did you use? Did you lemmatized the words?
That is surprising. I just tested it with the index you send me. There, the directory index
and index.histogram
are in the same directory and it is working as expected.
I don't have problems with calculating c_a or c_v on my local machine with the index you provided. From the error I assume, that the topics you are trying to rate create this problem. Can you please make sure that all topics have exactly the same number of words? Maybe you can provide the topics so I can take a deeper look into the problem.
Here are the details without using any nlp library:
For French, I hesitated between keeping and replacing accented characters (e.g., é -> e). For now, I keep the accents. If you like, you can provide the following weblink:
http://mediamining.univ-lyon2.fr/velcin/public/misc/wiki/fr_index.tar.gz
For (1), it is surprising. I confirm that, in my case, I have to move the index.histogram to index/.histogram.
For (2), you're right: if I use the same number of words for each topic, there is no problem for the two measures anymore!
Thank you for the useful help :).
Hello,
I am trying to upgrade Palmetto for Lucene 6.2.1. The current Lucene version clashes with another Lucene version in our classpath, and well, that's the opportunity to suggest a PR, if welcomed.
I already made some changes in a new branch in my fork (under progress) : https://github.com/cgravier/Palmetto/tree/lucene621
When I try to run
mvn clean install
, I get the trace below.It seems to me that the test resource file
src/test/resources/test_bd/segments_w
had been created with a Lucene 3.x version.I didn't find any sub routine to recreated those resource files (with Lucene 6.x for instance).
Is it possible to actually reconstruct those files ?
Thank you in advance,
cgravier