chartbeat-labs / textacy

NLP, before and after spaCy
https://textacy.readthedocs.io
Other
2.22k stars 250 forks source link

Updating triples.py to rely on spacy-wordnet and sense tagging #370

Open dzitney1 opened 1 year ago

dzitney1 commented 1 year ago

Context

Thank you for all of the work you have put into this library, it has helped me immensely! While I am inexperienced at submitting pull requests on major repos and blissfully ignorant of what goes into the compatibility testing and proper generation of documentation I did not want to let that deter me from submitting something.

For quote attribution the triples.py currently relies on constants.REPORTING_VERBS. The comment on line 201 of triples.py shows interest in implementing a model to perform this functionality.

Proposed solution

This solution would rely on an additional dependency spacy-wordnet which in turn relies on nltk. If instead of string matching lemmas to reporting verbs it may be possible to access the sense tagging from the wordnet corpus and use this method instead.

Beyond the additional support required for the increased dependencies I have found the following solution (which requires two changes) to work for me.

  1. When calling make_spacy_doc
    
    # en_core_web_trf is not required here, works with en_core_web_sm
    nlp = spacy.load('en_core_web_trf')

The following could possibly be implemented with some sort of config option in core.py

or added by the user in their own function

nlp.add_pipe("spacy_wordnet", after='tagger') doc = textacy.make_spacy_doc(text, lang=nlp)

2. Updating lines 253 and 254 of triples.py
From

tok.pos == VERB and tok.lemma_ in _reporting_verbs

To

tok.pos == VERB and tok..wordnet.lemmas() and tok..wordnet.lemmas()[0]._synset._lexname == 'verb.communication'