Living-with-machines / T-Res

A Toponym Resolution Pipeline for Digitised Historical Newspapers
https://living-with-machines.github.io/T-Res/
Other
7 stars 1 forks source link

Bug: API Error on • character #176

Open lukehare opened 1 year ago

lukehare commented 1 year ago

The • character appears relatively frequently in our newspaper data, and the toponym resolution pipeline doesn't no how to handle it. This causes the API to return an error.

E.g. Input:

{'sentence': ' • - ST G pOllO-P• FERRIS - • - , i '}

Output:

<Response [500]>
kallewesterling commented 1 year ago

Couldn't there be a regex search-and-replace for something like this? I think that's what Defoe does with some of this stuff...

[a-zA-Z]+$

Obviously, you might want to include - in there still...

fedenanni commented 1 year ago

I'll look into it to understand exactly at which point of the pipeline this happens, as it might be that it's either the ner or deezymatch or REL crashing and based on that we can decide how to handle it. But I agree with @kallewesterling that we can then quickly fix it with a regex

kasparvonbeelen commented 1 year ago

@fedenanni @lukehare @kallewesterling I'd guess this is caused by the tokenizer. In this case, it should be straightforward to add special tokens.

fedenanni commented 1 year ago

@lukehare I'm looking into it (see the work in progress PR: https://github.com/Living-with-machines/toponym-resolution/pull/177) but, from a first test, the bug does not seem to be in the pipeline. I have just added this test and the text goes through the entire pipeline without an issue.

fedenanni commented 1 year ago

Can you check if the issue is on the API side?

lukehare commented 1 year ago

I am still seeing the error, unfortunately. It looks from my logs that it is coming from DeezyMatch / the candidate_ranker. I have tried it via the API and running locally and I get the same result. Interestingly though it doesn't appear to specifically be because of the • character, as I have been able to get it to work by slightly changing the input text (deleting some characters) but leaving that character in.

See logs:

>>> resolved = geoparser.run_text(
...         " • - ST G pOllO-P• FERRIS - • - , i ",
...     )
Traceback (most recent call last):                                                                                                                                 
  File "<stdin>", line 1, in <module>
  File "/home/lukehare/toponym-resolution/geoparser/pipeline.py", line 226, in run_text
    sentence_dataset = self.run_sentence(
  File "/home/lukehare/toponym-resolution/geoparser/pipeline.py", line 149, in run_sentence
    wk_cands, self.myranker.already_collected_cands = self.myranker.find_candidates(mentions)
  File "/home/lukehare/toponym-resolution/geoparser/ranking.py", line 372, in find_candidates
    cands, self.already_collected_cands = self.run(queries)
  File "/home/lukehare/toponym-resolution/geoparser/ranking.py", line 349, in run
    return self.deezy_on_the_fly(queries)
  File "/home/lukehare/toponym-resolution/geoparser/ranking.py", line 287, in deezy_on_the_fly
    candidates = candidate_ranker(
  File "/home/lukehare/.cache/pypoetry/virtualenvs/resolution-jsCpgHtO-py3.9/lib/python3.9/site-packages/DeezyMatch/candidateRanker.py", line 327, in candidate_ranker
    tmp_dirname = query_vector_gen(query, model, train_vocab, dl_inputs, verbose)
  File "/home/lukehare/.cache/pypoetry/virtualenvs/resolution-jsCpgHtO-py3.9/lib/python3.9/site-packages/DeezyMatch/utils_candidate_ranker.py", line 60, in query_vector_gen
    test_model(
  File "/home/lukehare/.cache/pypoetry/virtualenvs/resolution-jsCpgHtO-py3.9/lib/python3.9/site-packages/DeezyMatch/rnn_networks.py", line 594, in test_model
    pred = model(
  File "/home/lukehare/.cache/pypoetry/virtualenvs/resolution-jsCpgHtO-py3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/lukehare/.cache/pypoetry/virtualenvs/resolution-jsCpgHtO-py3.9/lib/python3.9/site-packages/DeezyMatch/rnn_networks.py", line 878, in forward
    x1_embs_not_packed = self.emb(x1_seq)
  File "/home/lukehare/.cache/pypoetry/virtualenvs/resolution-jsCpgHtO-py3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/lukehare/.cache/pypoetry/virtualenvs/resolution-jsCpgHtO-py3.9/lib/python3.9/site-packages/torch/nn/modules/sparse.py", line 158, in forward
    return F.embedding(
  File "/home/lukehare/.cache/pypoetry/virtualenvs/resolution-jsCpgHtO-py3.9/lib/python3.9/site-packages/torch/nn/functional.py", line 2199, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self

Whereas this works...

>>> resolved = geoparser.run_text("• - ST G pOllO-P• FERR")
>>> resolved
[{'mention': 'G', 'candidates': {'Q133083': 0.985, None: 0.316}, 'ner_score': 0.579, 'pos': 7, 'sent_idx': 0, 'end_pos': 8, 'tag': 'LOC', 'sentence': '• - ST G pOllO-P• FERR', 'prediction': 'Q133083', 'ed_score': 0.985, 'latlon': [-26.0, 28.0], 'wkdt_class': 'Q191093'}]

Other examples that failed:

{'sentence': ' BY HER LETTERS W PATENT, corner of Deansgate, and B • RatNo 1, BLACKFRIARSeSTREET, Agents to the Corporation', 'place': 'Manchester, Greater Manchester, England'}
 output: <Response [500]>

{'sentence': ' - experience, Who wil! take every precaution tO promote theitealth tte View to tako plaee o •ednes•lay, July sth', 'place': 'Manchester, Greater Manchester, England'}
 output: <Response [500]>

{'sentence': " 5, N • ' Buildings, Market-street", 'place': 'Manchester, Greater Manchester, England'}
 output: <Response [500]>

{'sentence': ' built expressly for the Liverpo a • - York trade • al , is equ • JODWOODS T A K E S, JULY 26TH, 1848', 'place': 'Manchester, Greater Manchester, England'}
 output: <Response [500]>
lukehare commented 1 year ago

Update: We have identified that the bug occurs if the character is passed to the candidate ranker in DeezyMatch. We think this is caused by an incorrect OCR model (w2v_ocr) used in the API deployment. We're looking into where this model came from, and assuming it is out-of-date, we will redeploy the API with the correct model asap.

fedenanni commented 1 year ago

Regarding this, @mcollardanuy suggests it might be due to the fact that you created a "test" OCR model. The name should be different from the one i have (I should be _test, see here) but maybe due to a bug this is not true