allenai / mmda

multimodal document analysis
Apache License 2.0
158 stars 18 forks source link

Include symbols btwn disjoints #277

Closed geli-gel closed 10 months ago

geli-gel commented 10 months ago

Implements a solution for stringify.py's optional include_symbols_between_disjoint_spans that finds matched_words based on start and end span of the spangroup rather than direcly overlapping words with the spans. This will help with https://github.com/allenai/scholar/issues/36976 where we need to find mention text that may have disjoint spans within body text that includes those missing "in-between" characters.

example results:

stringified citation_mention spangroup:
Takase et al. 2018
stringified including in-between symbols citation_mention spangroup:
Takase et al., 2018

its sentence: 
stringified sentence spangroup:
We evaluate our method on the current state of the art model, DOC (Takase et al., 2018), and the previous state of the art model, MoS (Yang et al., 2018), on the Penn Treebank (Marcus et al., 1993) and WikiText-2 (Merity et al., 2017) language modeling datasets.
stringified including in-between symbols sentence spangroup:
We evaluate our method on the current state of the art model, DOC (Takase et al., 2018), and the previous state of the art model, MoS (Yang et al., 2018), on the Penn Treebank (Marcus et al., 1993) and WikiText-2 (Merity et al., 2017) language modeling datasets.
stringified citation_mention spangroup:
Yang et al. 2018
stringified including in-between symbols citation_mention spangroup:
Yang et al., 2018

its sentence: 
stringified sentence spangroup:
We evaluate our method on the current state of the art model, DOC (Takase et al., 2018), and the previous state of the art model, MoS (Yang et al., 2018), on the Penn Treebank (Marcus et al., 1993) and WikiText-2 (Merity et al., 2017) language modeling datasets.
stringified including in-between symbols sentence spangroup:
We evaluate our method on the current state of the art model, DOC (Takase et al., 2018), and the previous state of the art model, MoS (Yang et al., 2018), on the Penn Treebank (Marcus et al., 1993) and WikiText-2 (Merity et al., 2017) language modeling datasets.
stringified citation_mention spangroup:
Marcus et al. 1993
stringified including in-between symbols citation_mention spangroup:
Marcus et al., 1993

its sentence: 
stringified sentence spangroup:
We evaluate our method on the current state of the art model, DOC (Takase et al., 2018), and the previous state of the art model, MoS (Yang et al., 2018), on the Penn Treebank (Marcus et al., 1993) and WikiText-2 (Merity et al., 2017) language modeling datasets.
stringified including in-between symbols sentence spangroup:
We evaluate our method on the current state of the art model, DOC (Takase et al., 2018), and the previous state of the art model, MoS (Yang et al., 2018), on the Penn Treebank (Marcus et al., 1993) and WikiText-2 (Merity et al., 2017) language modeling datasets.
stringified citation_mention spangroup:
Merity et al. 2017
stringified including in-between symbols citation_mention spangroup:
Merity et al., 2017

its sentence: 
stringified sentence spangroup:
We evaluate our method on the current state of the art model, DOC (Takase et al., 2018), and the previous state of the art model, MoS (Yang et al., 2018), on the Penn Treebank (Marcus et al., 1993) and WikiText-2 (Merity et al., 2017) language modeling datasets.
stringified including in-between symbols sentence spangroup:
We evaluate our method on the current state of the art model, DOC (Takase et al., 2018), and the previous state of the art model, MoS (Yang et al., 2018), on the Penn Treebank (Marcus et al., 1993) and WikiText-2 (Merity et al., 2017) language modeling datasets.
stringified citation_mention spangroup:
Merity et al. 2018
stringified including in-between symbols citation_mention spangroup:
Merity et al., 2018

its sentence: 
stringified sentence spangroup:
In addition, we present results for finetuned (Merity et al., 2018) models, with and without the Partial Shuffle.
stringified including in-between symbols sentence spangroup:
In addition, we present results for finetuned (Merity et al., 2018) models, with and without the Partial Shuffle.