chartbeat-labs / textacy

NLP, before and after spaCy
https://textacy.readthedocs.io
Other
2.21k stars 249 forks source link

Improve quote detection by finding start and end quotation marks #380

Open afriedman412 opened 1 year ago

afriedman412 commented 1 year ago

The current code for detecting quotes is pretty unsophisticated. It just sequentially pairs anything the token.is_quote deems a quotation mark and assumes the indexes to be the quote boundaries. If there are an odd number of quotation marks, it throws an error. I've been doing quote detection in some of unreliably formatted text lately which has things like "»" used as bullet points and lots of unpredictable stray characters, so I came up with a workaround.

My version only matches specific types of quotation marks with their accepted counterparts. For example, if there is it will pair it with the next . If it never finds a match, it ignores the singleton quotation mark. I did this by making a list of 11 approved tuples of unicode codes (which live in .constants).

For example:

Bill told me I "shouldn‘t wear those pants" but I will.

In the current version, running quote detection here would raise an error because there are three quotation mark-like tokens in the sentence. Even if it didn't, it would return "shouldn" as a quote because textacy assumes sequential quotation marks are quote boundaries.

My version takes the first quotation mark (q) and iterates through all the later quotation marks until it finds one (q_) where (ord(q.text), ord(q_.text))is in the list of acceptable pairs. Here, that pair would be (34, 34) because a double quotation mark is unicode number 34.

bdewilde commented 1 year ago

Hi @afriedman412 , thanks for submitting these changes! I'm pretty sure I follow the updated logic, but it would be particularly helpful to have new tests (or just new test cases for an existing test) that confirm what we should expect to be handled by it. Would you be able to add some in here?

bdewilde commented 1 year ago

hey @afriedman412 , does the newer PR (https://github.com/chartbeat-labs/textacy/pull/382) supersede this one?