NielsRogge / Transformers-Tutorials

This repository contains demos I made with the Transformers library by HuggingFace.
MIT License
9.15k stars 1.42k forks source link

Word Grouping for Entities #10

Open sindhuattaiger opened 3 years ago

sindhuattaiger commented 3 years ago

The LayoutLM model is able to capture the entity class at word level. How do we group words based on entity?

jamcdon4 commented 1 year ago

Word entity grouping, I am not sure about a model, but you can use a rule based approach for grouping entities close together.

A code example is shown at this inference script. An example image of the output is here. Both are sourced from this medium article.

This is a good start, but I recommend tweaking to use something more robust, such as a dfs search, to make sure no entity groups are overlapping, etc.. Try replacing those lines from github with something like the following:

df_words = filtered_words.copy()
v_threshold = int(.01 * height)
h_threshold = int(.08 * width)
visited = set()
def dfs(i,merged):

    visited.add(i)
    merged.append(df_words[i])

    for j in range(len(df_words)):
        if j not in visited:
            w1 = df_words[i]['words'][0]
            w2 = df_words[j]['words'][0]

            # and 
            if (abs(w1['box'][1] - w2['box'][1]) < v_threshold or abs(w1['box'][-1] - w2['box'][-1]) < v_threshold) \
                and (df_words[i]['label'] == df_words[j]['label']) \
                and (abs(w1['box'][0] - w2['box'][0]) < h_threshold or abs(w1['box'][-2] - w2['box'][-2]) < h_threshold):
                dfs(j,merged)
    return merged

for i in range(len(df_words)):
    if i not in visited:
        merged_taggings.append(dfs(i,[]))