Open thobson88 opened 1 week ago
Related: in the Pipeline class there is some duplicated code.
Specifically, the code inside run_sentence
from the comment:
# If the linking method is "reldisamb", rank and format candidates,
# and produce a prediction:
to the end of the method, is almost identical to that in the run_disambiguation
method from the same comment onwards.
The only difference (apart from post-processing being optional) is, in run_sentence
:
# Run entity linking per mention:
selected_cand = self.mylinker.run(
{
"candidates": wk_cands[mention["mention"]],
"place_wqid": place_wqid,
}
)
versus, in run_disambiguation
:
# Run entity linking per mention:
selected_cand = self.mylinker.run(
{
"candidates": wk_cands[mention["mention"]],
"place_wqid": "",
}
)
(Note that this difference only affects the "mostpopular" & "bydistance" linker methods, not the "reldisamb" method.)
Similarly, the methods run_text
and run_text_recognition
contain some duplicated logic. Both start with this:
# Split the text into its sentences:
sentences = split_text_into_sentences(text, language="en")
document_dataset = []
for idx, sentence in enumerate(sentences):
# Get context (prev and next sentence)
context = ["", ""]
if idx - 1 >= 0:
context[0] = sentences[idx - 1]
if idx + 1 < len(sentences):
context[1] = sentences[idx + 1]
Then run_text_recognition
makes a call to run_sentence_recognition
. Whereas run_text
calls run_sentence
. But run_sentence
begins with a call to run_sentence_recognition
. So the logic is semi-duplicated, with some (potentially important) differences.
Apart from avoiding code duplication, the aim in refactoring the Pipeline will be to make it a thin wrapper around the Ranker and Linker, rather than containing its own "business logic" as it currently does.
It should be possible to run the components of the pipeline separately by calling methods on the Ranker and Linker, not the Pipeline, and get exactly the same results as running the full pipeline. Currently that is not possible.
We should consider implementing the different linking algorithms in subclasses of
Linker
.This would avoid redundant logic based on String parameters like:
Instead the
run
method would be implemented differently in two subclasses:MostPopularLinker
andByDistanceLinker
.Also the
train_load_model
, which only makes sense if themethod
isreldisamb
, would then live in (another) subclassRelLinker
.Ideally we would then improve the conditional logic in the
run_disambiguation
method inpipeline.py
, so that the three linker methods (reldisamb
,mostpopular
andbydistance
) are treated consistently.