epfl-dlab / quootstrap

Unsupervised method for extracting quotation-speaker pairs from large news corpora.
27 stars 2 forks source link

Extremely low recall #1

Open mkuymkuy opened 5 years ago

mkuymkuy commented 5 years ago

Thank you for your code. I build a local copy of the code and implement my own dataLoader. The code runs smoothly. Since the Freebase is deprecated and old. I got rid of the Freebase in the code. input: about 440m tokenized news articles from 1-2 week US English news. I did not use the spin3r data since it is too large output: about 97000 quotations and names pairs are extracted. But according to my own statistics on a small sample of the input about 100 articles. There are average 2 quotations in one article. The recall rate of quootstrap is extremely lower than I expected. Is this normal?

Some guessing:

  1. Should I use the spin3r to get discoveredPatterns.txt and replace seedPattern.txt with it. Then run the code with the new seedPattern.txt against my own dataset? (This method assumes that the spin3r data contains all possible patterns that in news data. I kind of doubt this assumption)
  2. I actually lowered the pattern confidences to 0 and change the M(if a pattern extracted pairs smaller than M = 5, the pattern would be discarded) to 1. The output is larger but the quality dropped drastically.
  3. I also extended the iteration number to 200. But while running there are usually no more new patterns would be discovered after 3-4 iterations. So larger iterations does not help.

Feel free to point out that if I missed some critical points.

Thanks!

dariopavllo commented 5 years ago

You should take a look at the logs (pre and post-clustered patterns at different iterations) to see if the algorithm is behaving as expected. As we point out in the paper, 4-5 iterations are enough since the algorithm converges past that point. More seed patterns would certainly help, but I don't think that's the main problem.

I would try/check the following:

Let me know if this helps.

mkuymkuy commented 5 years ago

Hi Dario,

Thank you for your quick response.

You can find my code changes in https://github.com/mkuymkuy/quootstrap

For freebase, I just got rid of it and retain the speaker extracted from article instead the speaker name to keep as many as possible candidates. The purpose is to retain all candidate even if some of them are "he", "she", etc. I can post process them.

Please note that these 2 files are generated after the larger seedPattern.txt was applied. Here is nextPatternsPreClustering0.txt https://pastebin.com/T5ZSA95F (755 patterns) Here is nextPatternsPostClustering0.txt https://pastebin.com/qygznxK4 (170 patterns left)I see that the average confidence is not very high

I must correct the article number, there are 441208 articles as input data. The file size is about 1.2G. Considering the data size, is the size of pattern in first iteration expected?

Thank you very much.

mkuymkuy commented 5 years ago

And also I am wondering what is the way to evaluate new data on top of this model. My first thought was to train this model with very large dataset like spinn3r and generate enough patterns and extract pairs with these patterns only. But considering the model actually assume same quote in multiple articles, most quotes in a small evaluation set most likely appear only once in certain articles. The extracted pair might be not good enough. But if for every small evaluation set, if we merge it with the original huge data set to run the model again and join the result. It would be too costly.

To minimize the cost, my gut feeling is like,

Even like this, the new data might still face the single quote issue. Can you help provide some pointers on evaluation step?

Thank!

dariopavllo commented 5 years ago

I had a look at your code & data. The implementation of your data loader seems correct -- just make sure that the article IDs are unique (otherwise that could cause side effects).

Your dataset is much smaller than Spinn3r (441k vs 3.8M after deduplication), so this might play a role. You get less than 1k patterns (pre-clustering), whereas this figure should be in the order of thousands.

I think you have a problem with named-entity recognition for detecting people's names. I couldn't understand what you exactly did with Freebase, but ideally you should provide a database of names which will be used to detect people in articles. You said "The purpose is to retain all candidate even if some of them are "he", "she", etc. I can post process them.", so I guess you modified the code (or the people dataset) to detect coreferences. This would break the algorithm, and probably explains why you get a very low pattern confidence after clustering. For instance, if you extract a pair (Q="Hello!", S=she), the algorithm will try to match "Hello!" to "she" in other articles, which is obviously wrong because the quote may appear with the full name.

If you want to implement coreference resolution, that would be very nice, but you would have to implement it inside the bootstrapping loop (e.g. right after the pattern extraction step or before the pattern clustering step).

Regarding inference on new data, you got the two basic ideas right. A costly but thorough method would be to merge the old data with the new data and re-run the algorithm from scratch. If you want to obtain results faster, you can just put all the discovered patterns in seedPatterns, and run the algorithm with iterations = 1. Of course, the single quote issue is always possible. Quootstrap doesn't have perfect recall, but the rationale here is that we have higher recall for redundant quotes, which means that they are more likely to be interesting (e.g. quotes by politicians). A way to boost recall at the expense of precision would be to implement coreference resolution.

mkuymkuy commented 5 years ago

Thank you for your comment. After your input, here is what I have done,

Thanks.