Abbreviation detection not working where short form contains a space followed by digits

ICLRandD / Blackstone

:black_circle: A spaCy pipeline and model for NLP on unstructured legal text.

https://research.iclr.co.uk

Apache License 2.0

637 stars 101 forks source link

Abbreviation detection not working where short form contains a space followed by digits #4

Closed ICLRandD closed 4 years ago

ICLRandD commented 5 years ago

The current implementation of the AbbreviationDetector() does not handle abbreviations that contain a short form followed by a space followed by a number

For example, in this scenario:

The Proceeds of Crime Act 2002 ("PoCA 2000")

The abbreviation is not matched.

The original implementation in scispaCy does not appear to have been built to handle instances in which the short form is bounded by quote marks).

philgooch commented 4 years ago

You might be interested in an alternative Python implementation of Schwartz-Hearst which handles this scenario.

https://github.com/philgooch/abbreviation-extraction

E.g.

pip install abbreviations

In [1]: from abbreviations import schwartz_hearst                                                                                                  

In [2]: schwartz_hearst.extract_abbreviation_definition_pairs(doc_text='The Proceeds of Crime Act 2002 ("PoCA 2002")')                             
Out[2]: {'PoCA 2002': 'Proceeds of Crime Act 2002'}