Closed prhbrt closed 7 years ago
This model can't distinguish actual Dutch from something close to Dutch
True! The model serves as a weak filter that prevents the worst spam, e.g. in Dutch crowdsourcing tasks.
Mention this shortcoming in the Readme.md: the score's of Dutch-ish sentences are in the same range as those for Dutch ones.
I'll do this, thanks!
Train this model on Dutch sentences vs. altered Dutch sentences. Altering can be replacing verb forms, replacing 'de' with 'het', removing the last few words, cutting words in half, and other replacements that would create a sentence that seems Dutch, but isn't.
I don't think this model is powerful enough to capture this contrast. The problem you're suggesting is more difficult than it looks. See e.g. work by Tal Linzen on the the grammatical capacities of LSTMs.
Thanks for the fast and honest reply!
I don't think this model is powerful enough to capture this contrast. The problem you're suggesting is more difficult than it looks. See e.g. work by Tal Linzen on the the grammatical capacities of LSTMs.
Probably, but you might get some credits for the effort of creating the data augmentation; plus it might work and at least will show how bad the model is (at detecting subtle linguistic mistakes). Although not sure how much Tal Linzen already did there.
I don't even think "Hij varkende de tekening." is bad Dutch. I know 'varken' is a noun, and here it's used as a verb, but it's correctly inflected for the position it's in. I can even imagine some contexts in which you could use it. E.g. suppose my friend has a rubber stamp with a pig on it, and they're stamping pig images on everything. I could say "Hij heeft al mijn spullen gevarkend! :(" and people in that situation would understand this (quite unusual) usage.
You can disagree with this example, but the point is that language is unbounded. You can make an infinite amount of different, grammatical sentences (only limited by time and creativity). Not having seen an example before doesn't make it bad. So the LSTM should go beyond the training data and generalize to a full grammar of Dutch. That's an extremely ambitious goal.
I'd encourage you to play with the model, though. Maybe just train it to compare actual Dutch with Dutch sentences shuffled at the word level (rather than shuffling characters). Most of the shuffled phrases should be nonsense so it's OK if there's an occasional grammatical sentence in the training data (a little bit of noise is fine.) As an additional check, you could just see whether the shuffled sentence is in the training data and, if so, shuffle it again.
I think we're at risk of starting a discussion on something we (mostly) agree on :) Unfortunately, I won't experiment with Dutch augmented to almost Dutch sentences (sorry).
I found this repo and thought it might help me, but concluded it was not usable for my particular problem and wanted to document this for anyone stumbling upon this, for many other purposes valid, repo.
Thanks for your comments!
I think what you're looking for is either something like a language model that outputs the perplexity for a particular sample of Dutch, or something like LanguageTool (https://www.languagetool.org/).
This model can't distinguish actual Dutch from something close to Dutch, for example:
Output:
It might be that the model learns common character patterns, and does what I think more linguistic LSTM's do, it bluffs and doesn't really understand the sentence on a deep level.
It is a good idea to:
Readme.md
: the score's of Dutch-ish sentences are in the same range as those for Dutch ones.