NTMC-Community / MatchZoo

Facilitating the design, comparison and sharing of deep text matching models.
Apache License 2.0
3.85k stars 900 forks source link

Strange behavior for my own data - is it overfitting? #774

Closed datistiquo closed 5 years ago

datistiquo commented 5 years ago

Hey!

I don't know if my issue is due to overfitting because I have less data of around 20 000 training samples. I do IR with just single sentences (if they match). So for a query I will test it against all documents I have. I hope I can get any advices before I will test my data with the arc models. Up to now I use a CNN siamese Network (representation based) with a dropout layer at the end. This model is very good on trained data but also generalises well somehow (some works and some works not). With that I could live with that.

But the strange thing (if it strange?) is that if the query is just an empty string it gets very high confidence (most 1!) with several documents!

Also if the query is just a single word which is totally out of scope (not trained and this words does not occur in any of the documents!) it also gets high confidence with some documents.

Does someone has made similar experience or has any thoughts what is going on there?

uduse commented 5 years ago

I think this is normal. When you have testing data that are completely out of the scope of your training data, the behavior of your network is pretty much random.

datistiquo commented 5 years ago

Yes, but why the behavior for an empty string? I don't actually know why this even gives a result since one side of the siamese network has no word vector or anything? So even for a perfectly trained model with no overfitting with arc you get such strange results for an empty question?

How could I prevent that non-related words (which are not contained in my docs) like just words from another domain get high confidences with any of my doc?

uduse commented 5 years ago

Think about it this way: your network starts with full confidence but whenever it seems something dubious, it decreases its confidence. You did not provide "nothing is provided" as a dubious scenario (i.e. training sample with an empty string), then it says there's "nothing wrong about it".

How could I prevent that non-related words (which are not contained in my docs) like just words from another domain get high confidences with any of my doc?

Simple answer: you can't. The unknown area (not covered by training samples) is so large that anything can happen there. If you could define specific behavior (e.g. low confidence) for the unknowns, then it means you separated the known and unknown, which also means overfitting since your model now operates only in known areas. If you really want a band-aid solution, try adding a classifier that filters queries that are at least somehow similar to your training data. (positive: any training sample, negative: randomly generated strings, random strings from wiki, e.t.c.)

datistiquo commented 5 years ago

Thanks but I did not know that:

Think about it this way: your network starts with full confidence

Why it doesn't start with zero confidence? isn't that model actually really stupid if it starts with full confidence? Is that for all matchzoo models?

I thought it learns pattern when something matches. So for a unknown word vector which has no similarity with any doc should give low confidence. That was my intuition...

Would this actually also the case if I use word vector information like cosine simalrity or euclidean distance and just train a simple Neural network or XGBoost?

If I train a classifier with specific topics wouldn't give an out of scope word low confidence (since this word is not belong to any of the topics? I thought I had this situation.

uduse commented 5 years ago

It's just a way of explaining why this happens. The point is that your network could behave randomly on out of scope data, and there's no way you can predict that. However, given its seemingly random behavior on a specific datum, there's maybe a way to explain it. That's why the amount of data is important. The more you get, the more predictable your model is.

If I train a classifier with specific topics wouldn't give an out of scope word low confidence (since this word is not belong to any of the topics? I thought I had this situation.

I don't understand this part.

datistiquo commented 5 years ago

So you really say that it is not possible even for perfectly trained model with generalization to handle out of scope data with low confidence?

So I should of course add some "dubious data" as negative examples?

I don't understand this part.

For eg 3 topics and for a sentence not related to any oif those 3 topics wouldn't this give low confidence?

uduse commented 5 years ago

"perfectly trained" is relative to your training data, and "generalization" is relative to your test data. They are both "known" to some extent. Once you cover some of the unknown with dubious data, then that part of the unknown is now known but the others left unknown unless you can cover all types of unknown data, which is usually impossible in NLP.

datistiquo commented 5 years ago

Sorry for bothering you with this "out of scope" subject.

Are your statements true for any Deep Learning IR model (high confidence for not related queries)?

I hoped that my model will learn some patterns to give low confidence like a query below:

I need information about XY, info about XY, 1

is trained and works

But now I recognize that just for the query with the word

need

it has very high confidence with random several docs where also the word itself does not appear!

I train several sentence with the focus on entities of my domain.

Let's don't focus on out of scope. Is this due to overfitting? If the word 'need' comes only with context like the first sentences and is not trained as a single or piece of as positive sentence, would any well trained model give this issue? I rather don't want to train just pieces of sentences without any entity like above as negative examples? Any idea?

uduse commented 5 years ago

Then yes, it's overfitting. If your data that contain the word "need" are all positive or share some other kind of statistical feature, then it is perfectly normal for your network to judge purely on the existence of the word "need".

aszhanghuali commented 5 years ago

@datistiquo Hello, I see that you are also doing text retrieval. Can I look at your data format? I don't know how to deal with the data format. Thank you! Looking forward to your reply!