evanmiltenburg / Dutch-checker

Language model that checks if a sequence of characters is Dutch.
Apache License 2.0
0 stars 0 forks source link

Dutchish example #1

Closed prhbrt closed 7 years ago

prhbrt commented 7 years ago

This model can't distinguish actual Dutch from something close to Dutch, for example:

dutchish = ['Een man loopt gisteren naar straat.', 'Ik zie een', 'Hij staan in het keuken.', 'Hij varkende de tekening.']
# ...
for sentence in dutch + dutchish + english + garbage:

Output:

0.9985290765762329  Een man loopt op straat.
0.999811589717865   Ik zie een peuter met een schep.
0.9999125003814697  Hij staat in de keuken.
0.9995104074478149  Een man loopt gisteren naar straat.
0.9942237734794617  Ik zie een
0.9998989105224609  Hij staat in het keuken.
0.9995352029800415  Hij varkende de tekening.
0.03872348740696907     A man crossing the street
0.012580601498484612    A toddler with a shovel.
0.611192524433136   He's in the kitchen.
0.3727759122848511  aasdg gdasdf asdf
0.08913702517747879     asd trh afd asg
0.3877716064453125  sdfasdf

It might be that the model learns common character patterns, and does what I think more linguistic LSTM's do, it bluffs and doesn't really understand the sentence on a deep level.

It is a good idea to:

evanmiltenburg commented 7 years ago

This model can't distinguish actual Dutch from something close to Dutch

True! The model serves as a weak filter that prevents the worst spam, e.g. in Dutch crowdsourcing tasks.

Mention this shortcoming in the Readme.md: the score's of Dutch-ish sentences are in the same range as those for Dutch ones.

I'll do this, thanks!

Train this model on Dutch sentences vs. altered Dutch sentences. Altering can be replacing verb forms, replacing 'de' with 'het', removing the last few words, cutting words in half, and other replacements that would create a sentence that seems Dutch, but isn't.

I don't think this model is powerful enough to capture this contrast. The problem you're suggesting is more difficult than it looks. See e.g. work by Tal Linzen on the the grammatical capacities of LSTMs.

prhbrt commented 7 years ago

Thanks for the fast and honest reply!

I don't think this model is powerful enough to capture this contrast. The problem you're suggesting is more difficult than it looks. See e.g. work by Tal Linzen on the the grammatical capacities of LSTMs.

Probably, but you might get some credits for the effort of creating the data augmentation; plus it might work and at least will show how bad the model is (at detecting subtle linguistic mistakes). Although not sure how much Tal Linzen already did there.

evanmiltenburg commented 7 years ago

I don't even think "Hij varkende de tekening." is bad Dutch. I know 'varken' is a noun, and here it's used as a verb, but it's correctly inflected for the position it's in. I can even imagine some contexts in which you could use it. E.g. suppose my friend has a rubber stamp with a pig on it, and they're stamping pig images on everything. I could say "Hij heeft al mijn spullen gevarkend! :(" and people in that situation would understand this (quite unusual) usage.

You can disagree with this example, but the point is that language is unbounded. You can make an infinite amount of different, grammatical sentences (only limited by time and creativity). Not having seen an example before doesn't make it bad. So the LSTM should go beyond the training data and generalize to a full grammar of Dutch. That's an extremely ambitious goal.

I'd encourage you to play with the model, though. Maybe just train it to compare actual Dutch with Dutch sentences shuffled at the word level (rather than shuffling characters). Most of the shuffled phrases should be nonsense so it's OK if there's an occasional grammatical sentence in the training data (a little bit of noise is fine.) As an additional check, you could just see whether the shuffled sentence is in the training data and, if so, shuffle it again.

prhbrt commented 7 years ago

I think we're at risk of starting a discussion on something we (mostly) agree on :) Unfortunately, I won't experiment with Dutch augmented to almost Dutch sentences (sorry).

I found this repo and thought it might help me, but concluded it was not usable for my particular problem and wanted to document this for anyone stumbling upon this, for many other purposes valid, repo.

evanmiltenburg commented 7 years ago

Thanks for your comments!

I think what you're looking for is either something like a language model that outputs the perplexity for a particular sample of Dutch, or something like LanguageTool (https://www.languagetool.org/).