TimSchopf / KeyphraseVectorizers

Set of vectorizers that extract keyphrases with part-of-speech patterns from a collection of text documents and convert them into a document-keyphrase matrix.
https://arxiv.org/abs/2210.05245
BSD 3-Clause "New" or "Revised" License
253 stars 34 forks source link

Regex from the paper? #32

Open turian opened 1 year ago

turian commented 1 year ago

Can you please include, at least in the documentation, the regex from the paper?

In this code, the "standard is to only select keyphrases that have 0 or more adjectives, followed by 1 or more nouns."

In the paper, the POS pattern is "arbitrary parts-of-speech separated by a hyphen, followed by zero or more nouns OR zero or one verb (gerund or present or past participle), followed by zero or more adjectives, followed by one or more nouns"