Living-with-machines / T-Res

A Toponym Resolution Pipeline for Digitised Historical Newspapers
Other
7 stars 1 forks source link

Refactor the Linker #276

Open thobson88 opened 1 week ago

thobson88 commented 1 week ago

We should consider implementing the different linking algorithms in subclasses of Linker.

This would avoid redundant logic based on String parameters like:

def run(self, dict_mention: dict) -> Tuple[str, float, dict]:
    if self.method == "mostpopular":
        return self.most_popular(dict_mention)

    if self.method == "bydistance":
        return self.by_distance(dict_mention)

    raise SyntaxError(f"Unknown method provided: {self.method}")

Instead the run method would be implemented differently in two subclasses: MostPopularLinker and ByDistanceLinker.

Also the train_load_model, which only makes sense if the method is reldisamb, would then live in (another) subclass RelLinker.

Ideally we would then improve the conditional logic in the run_disambiguation method in pipeline.py, so that the three linker methods (reldisamb, mostpopular and bydistance) are treated consistently.

thobson88 commented 6 days ago

Related: in the Pipeline class there is some duplicated code.

Specifically, the code inside run_sentence from the comment:

        # If the linking method is "reldisamb", rank and format candidates,
        # and produce a prediction:

to the end of the method, is almost identical to that in the run_disambiguation method from the same comment onwards.

The only difference (apart from post-processing being optional) is, in run_sentence:

                # Run entity linking per mention:
                selected_cand = self.mylinker.run(
                    {
                        "candidates": wk_cands[mention["mention"]],
                        "place_wqid": place_wqid,
                    }
                )

versus, in run_disambiguation:

                # Run entity linking per mention:
                selected_cand = self.mylinker.run(
                    {
                        "candidates": wk_cands[mention["mention"]],
                        "place_wqid": "",
                    }
                )

(Note that this difference only affects the "mostpopular" & "bydistance" linker methods, not the "reldisamb" method.)


Similarly, the methods run_text and run_text_recognition contain some duplicated logic. Both start with this:

        # Split the text into its sentences:
        sentences = split_text_into_sentences(text, language="en")

        document_dataset = []
        for idx, sentence in enumerate(sentences):
            # Get context (prev and next sentence)
            context = ["", ""]
            if idx - 1 >= 0:
                context[0] = sentences[idx - 1]
            if idx + 1 < len(sentences):
                context[1] = sentences[idx + 1]

Then run_text_recognition makes a call to run_sentence_recognition. Whereas run_text calls run_sentence. But run_sentence begins with a call to run_sentence_recognition. So the logic is semi-duplicated, with some (potentially important) differences.

thobson88 commented 6 days ago

Apart from avoiding code duplication, the aim in refactoring the Pipeline will be to make it a thin wrapper around the Ranker and Linker, rather than containing its own "business logic" as it currently does.

It should be possible to run the components of the pipeline separately by calling methods on the Ranker and Linker, not the Pipeline, and get exactly the same results as running the full pipeline. Currently that is not possible.

thobson88 commented 3 days ago

There may be a similar issue relating to candidate ranking: currently the Ranker class is used when ranking with DeezyMatch, but when the linker method is reldisamb an additional ranking step is applied in the Pipeline class (here and here).