clips / pattern

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.
https://github.com/clips/pattern/wiki
BSD 3-Clause "New" or "Revised" License
8.75k stars 1.58k forks source link

new:irregular inflection of prefix verbs with known base #258

Closed JakobSteixner closed 5 years ago

JakobSteixner commented 5 years ago

Fix implementing logic to correctly identify the (irregular) base of prefixed verbs. Old:

>>> conjugate('gehen', (de.PAST, 2, de.SINGULAR)) 
'gingst' # correct
>>> conjugate('vorgehen', (de.PAST, 2, de.SINGULAR))
'gehtest vor' # incorrect

Explanation: since 'vorgehen' is not found in the lexicon, a default regular inflection strategy applies. Even though the separable prefix is correctly identified, the base form thus extracted isn't checked against the lexicon and the available information about its irregular inflection thus lost.

New:

>>> conjugate('gehen', (de.PAST, 2, de.SINGULAR)) 
'gingst' # correct
>>> conjugate('vorgehen', (de.PAST, 2, de.SINGULAR))
'gingst vor' # correct

This fix is achieved with a second pass to lemma after stripping the prefix, to identify the known irregular inflection of the base form 'gehen'.

Further, blacklists of verbs that look like they might be prefix verbs or latinate verbs with the suffix 'ier(en)' have been included to block the parser's exceptional treatment of those.

coveralls commented 5 years ago

Coverage Status

Coverage increased (+0.2%) to 68.523% when pulling 924cf1e1727e5ae61e7cbea0b70d3204440fb990 on JakobSteixner:feature/irregular_inflection_decomposed_prefix_verbs into 53245196139c6ef26dc9c34873dda8a16f236d23 on clips:development.

JakobSteixner commented 5 years ago
JakobSteixner commented 5 years ago

@thalelinh test_web.TestSearchEngine.test_search_twitter failing under 2.7

tales-aparecida commented 5 years ago

Which tests?

JakobSteixner commented 5 years ago

Which tests?

test_web.py, TestSearchEngine.test_search_twitter -- L503 with 0 results (1 expected), maybe an API limit issue? Nothing either of us touched, I'm skipping it for now.

log: https://travis-ci.org/clips/pattern/builds/514015081?utm_source=github_status&utm_medium=notification

tales-aparecida commented 5 years ago

Which tests?

test_web.py, TestSearchEngine.test_search_twitter -- L503 with 0 results (1 expected), maybe an API limit issue? Nothing either of us touched, I'm skipping it for now.

log: https://travis-ci.org/clips/pattern/builds/514015081?utm_source=github_status&utm_medium=notification

Yeah, I made a retry loop on my PR (which focused on fixing the CI). It worked there, not the best solution though :sweat_smile:

tales-aparecida commented 5 years ago

My guess now is that there's a seed(0) missing after _test_classifier instantiate a Classifier, during the test_slp. If you look at shuffled definition, which is called without the seed param at SLP constructor, you will see that it calls seed(None), which is equivalent of seed(time.now), that we obviously don't want to happen in a test case. :smile:

tales-aparecida commented 5 years ago

The Twitter test error was actually just causality. It's a query using a given string, sometimes there aren't any results.

JakobSteixner commented 5 years ago

The Twitter test error was actually just causality. It's a query using a given string, sometimes there aren't any results.

I thought so. Now failing with a version conflict in pytest setup though...

JakobSteixner commented 5 years ago

Appears to pass now!