Extremely low recall - Githubissues

mkuymkuy commented 6 years ago

Thank you for your code. I build a local copy of the code and implement my own dataLoader. The code runs smoothly. Since the Freebase is deprecated and old. I got rid of the Freebase in the code. input: about 440m tokenized news articles from 1-2 week US English news. I did not use the spin3r data since it is too large output: about 97000 quotations and names pairs are extracted. But according to my own statistics on a small sample of the input about 100 articles. There are average 2 quotations in one article. The recall rate of quootstrap is extremely lower than I expected. Is this normal?

Some guessing:

Should I use the spin3r to get discoveredPatterns.txt and replace seedPattern.txt with it. Then run the code with the new seedPattern.txt against my own dataset? (This method assumes that the spin3r data contains all possible patterns that in news data. I kind of doubt this assumption)
I actually lowered the pattern confidences to 0 and change the M(if a pattern extracted pairs smaller than M = 5, the pattern would be discarded) to 1. The output is larger but the quality dropped drastically.
I also extended the iteration number to 200. But while running there are usually no more new patterns would be discovered after 3-4 iterations. So larger iterations does not help.

Feel free to point out that if I missed some critical points.

Thanks!

dariopavllo commented 6 years ago

You should take a look at the logs (pre and post-clustered patterns at different iterations) to see if the algorithm is behaving as expected. As we point out in the paper, 4-5 iterations are enough since the algorithm converges past that point. More seed patterns would certainly help, but I don't think that's the main problem.

I would try/check the following:

Since you got rid of Freebase, I assume that you replaced it with something equivalent to get the list of names. Make sure that everything is implemented correctly and that you have enough names in your database. Names that aren't in the database won't be extracted, and that would also affect negatively the bootstrapping procedure.
The most critical step is the first iteration. Have a look at nextPatternsPreClustering0.txt to see if there are many candidate patterns (as expected). Then take a look at nextPatternsPostClustering0.txt to see how many of these are actually retained (and their average confidence value). If you don't mind, you can paste them somewhere (e.g. pastebin) and I'll be happy to take a look at them.
Case sensitivity might play a role, depending on how your dataset has been extracted. Try to set CASE_SENSITIVE=false in the config to see if this changes anything.
I would not set the pattern confidence below 0.7, otherwise the extracted patterns become nonsense. Also, the fact that you set M = 1 makes me think that there are few extracted patterns.
If everything is set up correctly, maybe the data is to blame. If the dataset is not redundant enough (or if it's not large enough), Quootstrap won't infer enough patterns for the next iteration. In this case increasing the number of seed patterns would help. Try this larger set: https://github.com/epfl-dlab/quootstrap/blob/87fb623932e4cee5955433c9342adacf02605306/seedPatterns.txt

Let me know if this helps.

mkuymkuy commented 6 years ago

Hi Dario,

Thank you for your quick response.

You can find my code changes in https://github.com/mkuymkuy/quootstrap

For freebase, I just got rid of it and retain the speaker extracted from article instead the speaker name to keep as many as possible candidates. The purpose is to retain all candidate even if some of them are "he", "she", etc. I can post process them.

Please note that these 2 files are generated after the larger seedPattern.txt was applied. Here is nextPatternsPreClustering0.txt https://pastebin.com/T5ZSA95F (755 patterns) Here is nextPatternsPostClustering0.txt https://pastebin.com/qygznxK4 (170 patterns left)I see that the average confidence is not very high

I must correct the article number, there are 441208 articles as input data. The file size is about 1.2G. Considering the data size, is the size of pattern in first iteration expected?

Thank you very much.

mkuymkuy commented 6 years ago

And also I am wondering what is the way to evaluate new data on top of this model. My first thought was to train this model with very large dataset like spinn3r and generate enough patterns and extract pairs with these patterns only. But considering the model actually assume same quote in multiple articles, most quotes in a small evaluation set most likely appear only once in certain articles. The extracted pair might be not good enough. But if for every small evaluation set, if we merge it with the original huge data set to run the model again and join the result. It would be too costly.

To minimize the cost, my gut feeling is like,

train the model with a very huge data set with enough diversity,
set the discovered patterns as seedPattern for future use.
every time when I got a new evaluation set, just merge the data with huge training data and run the model, but change the iteration = 1.

Even like this, the new data might still face the single quote issue. Can you help provide some pointers on evaluation step?

Thank!

dariopavllo commented 6 years ago

I had a look at your code & data. The implementation of your data loader seems correct -- just make sure that the article IDs are unique (otherwise that could cause side effects).

Your dataset is much smaller than Spinn3r (441k vs 3.8M after deduplication), so this might play a role. You get less than 1k patterns (pre-clustering), whereas this figure should be in the order of thousands.

I think you have a problem with named-entity recognition for detecting people's names. I couldn't understand what you exactly did with Freebase, but ideally you should provide a database of names which will be used to detect people in articles. You said "The purpose is to retain all candidate even if some of them are "he", "she", etc. I can post process them.", so I guess you modified the code (or the people dataset) to detect coreferences. This would break the algorithm, and probably explains why you get a very low pattern confidence after clustering. For instance, if you extract a pair (Q="Hello!", S=she), the algorithm will try to match "Hello!" to "she" in other articles, which is obviously wrong because the quote may appear with the full name.

If you want to implement coreference resolution, that would be very nice, but you would have to implement it inside the bootstrapping loop (e.g. right after the pattern extraction step or before the pattern clustering step).

Regarding inference on new data, you got the two basic ideas right. A costly but thorough method would be to merge the old data with the new data and re-run the algorithm from scratch. If you want to obtain results faster, you can just put all the discovered patterns in seedPatterns, and run the algorithm with iterations = 1. Of course, the single quote issue is always possible. Quootstrap doesn't have perfect recall, but the rationale here is that we have higher recall for redundant quotes, which means that they are more likely to be interesting (e.g. quotes by politicians). A way to boost recall at the expense of precision would be to implement coreference resolution.

mkuymkuy commented 5 years ago

Thank you for your comment. After your input, here is what I have done,

to get all the named-entities, I implemented a step to preprocessing the whole corpus and extracted all the names in articles as the new name base. Then run the program on top of that.
After that, the coverage looks better now. I got about 1.5 million unique quotes from 12 million passages. Which is a good improvement but still I found many easy cases in some articles are not extracted by the algorithm and the reason behind is not clear.
due to limited time and resources. I decided to stop exploring the method. But the idea of your paper still helped me a lot.

Thanks.

epfl-dlab / quootstrap

Extremely low recall #1