Open sindhuattaiger opened 3 years ago
Word entity grouping, I am not sure about a model, but you can use a rule based approach for grouping entities close together.
A code example is shown at this inference script. An example image of the output is here. Both are sourced from this medium article.
This is a good start, but I recommend tweaking to use something more robust, such as a dfs search, to make sure no entity groups are overlapping, etc.. Try replacing those lines from github with something like the following:
df_words = filtered_words.copy()
v_threshold = int(.01 * height)
h_threshold = int(.08 * width)
visited = set()
def dfs(i,merged):
visited.add(i)
merged.append(df_words[i])
for j in range(len(df_words)):
if j not in visited:
w1 = df_words[i]['words'][0]
w2 = df_words[j]['words'][0]
# and
if (abs(w1['box'][1] - w2['box'][1]) < v_threshold or abs(w1['box'][-1] - w2['box'][-1]) < v_threshold) \
and (df_words[i]['label'] == df_words[j]['label']) \
and (abs(w1['box'][0] - w2['box'][0]) < h_threshold or abs(w1['box'][-2] - w2['box'][-2]) < h_threshold):
dfs(j,merged)
return merged
for i in range(len(df_words)):
if i not in visited:
merged_taggings.append(dfs(i,[]))
The LayoutLM model is able to capture the entity class at word level. How do we group words based on entity?