Open lukehare opened 1 year ago
Couldn't there be a regex search-and-replace for something like this? I think that's what Defoe does with some of this stuff...
[a-zA-Z]+$
Obviously, you might want to include -
in there still...
I'll look into it to understand exactly at which point of the pipeline this happens, as it might be that it's either the ner
or deezymatch
or REL
crashing and based on that we can decide how to handle it. But I agree with @kallewesterling that we can then quickly fix it with a regex
@fedenanni @lukehare @kallewesterling I'd guess this is caused by the tokenizer. In this case, it should be straightforward to add special tokens.
@lukehare I'm looking into it (see the work in progress PR: https://github.com/Living-with-machines/toponym-resolution/pull/177) but, from a first test, the bug does not seem to be in the pipeline. I have just added this test and the text goes through the entire pipeline without an issue.
Can you check if the issue is on the API side?
I am still seeing the error, unfortunately. It looks from my logs that it is coming from DeezyMatch / the candidate_ranker. I have tried it via the API and running locally and I get the same result. Interestingly though it doesn't appear to specifically be because of the • character, as I have been able to get it to work by slightly changing the input text (deleting some characters) but leaving that character in.
See logs:
>>> resolved = geoparser.run_text(
... " • - ST G pOllO-P• FERRIS - • - , i ",
... )
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/lukehare/toponym-resolution/geoparser/pipeline.py", line 226, in run_text
sentence_dataset = self.run_sentence(
File "/home/lukehare/toponym-resolution/geoparser/pipeline.py", line 149, in run_sentence
wk_cands, self.myranker.already_collected_cands = self.myranker.find_candidates(mentions)
File "/home/lukehare/toponym-resolution/geoparser/ranking.py", line 372, in find_candidates
cands, self.already_collected_cands = self.run(queries)
File "/home/lukehare/toponym-resolution/geoparser/ranking.py", line 349, in run
return self.deezy_on_the_fly(queries)
File "/home/lukehare/toponym-resolution/geoparser/ranking.py", line 287, in deezy_on_the_fly
candidates = candidate_ranker(
File "/home/lukehare/.cache/pypoetry/virtualenvs/resolution-jsCpgHtO-py3.9/lib/python3.9/site-packages/DeezyMatch/candidateRanker.py", line 327, in candidate_ranker
tmp_dirname = query_vector_gen(query, model, train_vocab, dl_inputs, verbose)
File "/home/lukehare/.cache/pypoetry/virtualenvs/resolution-jsCpgHtO-py3.9/lib/python3.9/site-packages/DeezyMatch/utils_candidate_ranker.py", line 60, in query_vector_gen
test_model(
File "/home/lukehare/.cache/pypoetry/virtualenvs/resolution-jsCpgHtO-py3.9/lib/python3.9/site-packages/DeezyMatch/rnn_networks.py", line 594, in test_model
pred = model(
File "/home/lukehare/.cache/pypoetry/virtualenvs/resolution-jsCpgHtO-py3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/lukehare/.cache/pypoetry/virtualenvs/resolution-jsCpgHtO-py3.9/lib/python3.9/site-packages/DeezyMatch/rnn_networks.py", line 878, in forward
x1_embs_not_packed = self.emb(x1_seq)
File "/home/lukehare/.cache/pypoetry/virtualenvs/resolution-jsCpgHtO-py3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/lukehare/.cache/pypoetry/virtualenvs/resolution-jsCpgHtO-py3.9/lib/python3.9/site-packages/torch/nn/modules/sparse.py", line 158, in forward
return F.embedding(
File "/home/lukehare/.cache/pypoetry/virtualenvs/resolution-jsCpgHtO-py3.9/lib/python3.9/site-packages/torch/nn/functional.py", line 2199, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self
Whereas this works...
>>> resolved = geoparser.run_text("• - ST G pOllO-P• FERR")
>>> resolved
[{'mention': 'G', 'candidates': {'Q133083': 0.985, None: 0.316}, 'ner_score': 0.579, 'pos': 7, 'sent_idx': 0, 'end_pos': 8, 'tag': 'LOC', 'sentence': '• - ST G pOllO-P• FERR', 'prediction': 'Q133083', 'ed_score': 0.985, 'latlon': [-26.0, 28.0], 'wkdt_class': 'Q191093'}]
Other examples that failed:
{'sentence': ' BY HER LETTERS W PATENT, corner of Deansgate, and B • RatNo 1, BLACKFRIARSeSTREET, Agents to the Corporation', 'place': 'Manchester, Greater Manchester, England'}
output: <Response [500]>
{'sentence': ' - experience, Who wil! take every precaution tO promote theitealth tte View to tako plaee o •ednes•lay, July sth', 'place': 'Manchester, Greater Manchester, England'}
output: <Response [500]>
{'sentence': " 5, N • ' Buildings, Market-street", 'place': 'Manchester, Greater Manchester, England'}
output: <Response [500]>
{'sentence': ' built expressly for the Liverpo a • - York trade • al , is equ • JODWOODS T A K E S, JULY 26TH, 1848', 'place': 'Manchester, Greater Manchester, England'}
output: <Response [500]>
Update: We have identified that the bug occurs if the •
character is passed to the candidate ranker in DeezyMatch. We think this is caused by an incorrect OCR model (w2v_ocr
) used in the API deployment. We're looking into where this model came from, and assuming it is out-of-date, we will redeploy the API with the correct model asap.
The • character appears relatively frequently in our newspaper data, and the toponym resolution pipeline doesn't no how to handle it. This causes the API to return an error.
E.g. Input:
Output: