tokenize phrasal-verbs - Githubissues

NaturalNode / natural

general natural language facilities for node

MIT License

10.53k stars 860 forks source link

tokenize phrasal-verbs #473

Open ayman-ibrahim opened 5 years ago

ayman-ibrahim commented 5 years ago

is there is a way to tokenize a sentence taking into consideration phrasal-verbs. example:

"The flight take off at three o'clock"

output should be: [the, flight, take off, at, three, o'clock]

take off should be tokenized as one word.

Hugo-ter-Doest commented 5 years ago

Imho that is not what tokenization is meant for. Tokenization splits a text into words (and punctuation, if necessary) and "take off" consists two words. Combining them into a phrasal verb requires partial parsing or chunking.

ayman-ibrahim commented 5 years ago

@Hugo-ter-Doest Ok, do you know if there's a way to combine phrasal verbs in natural library ?

Hugo-ter-Doest commented 5 years ago

It's not yet in natural, but I'm working on that to use it for named entity recognition. You can have a preview at a CYK and Earley parsers here in this branch: https://github.com/Hugo-ter-Doest/natural/tree/NER/

parsers are in lib/natural/parsers a chunker based on the Earley parser is in lib/natural/NER

Feel free to already use that, but it may still change.

ayman-ibrahim commented 5 years ago

cool, I'll have a look. Thanks.

lazharichir commented 5 years ago

You could tokenize your sentence, tag each token's part of speech, and then find patterns. For example, VERB + DET or VERB + PREPOSITION. I use that to find noun phrases (JJ|NN+).

privateOmega commented 5 years ago

@Hugo-ter-Doest Do you have a set timeline as to when you would be able to integrate the code into Natural's codebase?

lazharichir commented 5 years ago

You can implement that, for now, using some sort of pattern matching (e.g. spaCy) such as you would walk the array of tokens, and find whatever patterns you are looking for (e.g. NOUN followed by PREP, or as many NOUNS/ADJ followed by PREP, etc).

You can look at spaCy's code (python) and port it to Node and Natural's token structure: https://github.com/explosion/spaCy/tree/master/spacy/matcher