amazon-science / tanl

Structured Prediction as Translation between Augmented Natural Languages
Apache License 2.0
130 stars 25 forks source link

Bug in augment_sentence function #4

Closed xiangc2 closed 3 years ago

xiangc2 commented 3 years ago

Hi, there seems to be a small bug in augment_sentence function in utils.py. When the root of the entity tree is an entity with tags, those tags won't be augmented onto the output. For example, when I run the code above:

from utils import augment_sentence

tokens = ['Tolkien', 'was', 'born', 'here']
augmentations =  [
        ([('person',), ('born in', 'here')], 0, 1),
        ([('location',)], 3, 4),
    ]

# example from the test set of conll03 NER
tokens = ['Premier', 'league']
augmentations = [([('miscellaneous',)], 0, 2)]

begin_entity_token = "["
sep_token = "|"
relation_sep_token = "="
end_entity_token = "]"

augmented_output = augment_sentence(tokens, augmentations, begin_entity_token, sep_token, relation_sep_token, end_entity_token)
print(augmented_output)

It prints out Premier league instead of [ Premier league | miscellaneous ]. This happened because in line 124 (utils.py), the value of the root in entity tree is reset to an empty list. My quick fix of this is initializing the start index of the root as -1. That is changing line 103 in utils.py to

root = (None, -1, len(tokens))   # this node represents the entire sentence

It would be great if someone could let me know if I am correct on this. Thanks!

giove91 commented 3 years ago

Hi, you are right. Thanks for spotting the problem and providing a simple solution!

xiangc2 commented 3 years ago

Hi,

I just found that my solution above is incomplete. There also have to be changes made in expand_tokens to recover the index from -1 to 0. To be more specific, I add

i = max(i, 0)

above line 39 and

start = max(start, 0)

before line 45 in utils.py. I am sorry for not fully testing my solution before posting.

Best, Xiang

giove91 commented 3 years ago

Thanks for pointing this out!

I re-wrote augment_sentence and expand_tokens indexing the nodes by integers (their positions in the augmentations list) instead of spans. The root is now -1. This implementation should be cleaner, as it clearly distinguishes between the root and an entity that spans the whole sentence.