arkilpatel / SVAMP

NAACL 2021: Are NLP Models really able to Solve Simple Math Word Problems?
MIT License
116 stars 34 forks source link

How to generate group_nums? #3

Closed RishabhMaheshwary closed 3 years ago

RishabhMaheshwary commented 3 years ago
  1. How are the group_nums generated for SVAMP datasets?
  2. What is the procedure to generate group_nums for new examples at test time?
arkilpatel commented 3 years ago

We followed the procedure from the Graph2Tree paper (see section 3.1.2) describing how to obtain the components for the Quantity Cells which is what is represented by "group_nums".

Note that their code repository does not provide the way to obtain these "group_nums" either. Moreover, if you notice the group_nums for the MAWPS dataset used by them (and us, for the sake of consistency), you can see that the indexes are basically just a window of size 3 around each number in the sentence and the last 3 words. For SVAMP, we follow this, i.e. including the window around each number and the last three words, but we also try and implement the steps as mentioned in the paper i.e. obtaining associated nouns, adjectives, verbs etc via dependency parsing. This can be seen in the code snippet shown below: (We use the Stanza library for parsing the sentences)

import stanza
nlp_stanza = stanza.Pipeline(lang='en', processors='tokenize, pos, lemma, depparse')

def add_group_nums(sent):
    sent = re.sub(r"-", r"", sent)
    sent = re.sub(r"mrs.", r"mrs", sent)
    sent_nums = re.findall('\d*\.?\d+', sent)
    doc = nlp_stanza(sent)
    sent = word_tokenize(sent)

    final_ids = []
    assoc_nouns = []
    adjectives = []
    assoc_verbs = []
    rates = []

    offset = 0

    for s in doc.sentences:
        last_id = 0
        for word in s.words:
            if word.text in sent_nums:
                final_ids.append(offset + word.id-1)
                if offset + (word.id-1) - 1 >= 0 and sent[offset + (word.id-1) - 1] not in [',', '.', ';']:
                    final_ids.append(offset + (word.id-1) - 1)
                if offset + (word.id-1) + 1 < len(sent) and sent[offset + (word.id-1) + 1] not in [',', '.', ';']:
                    final_ids.append(offset + (word.id-1) + 1)
                if word.deprel in ['nummod', 'nmode']:
                    assoc_nouns.append(s.words[word.head-1].text)
                    final_ids.append(offset + word.head-1)
            if word.text in ['each', 'every', 'per']:
                rates.append(word.text)
                final_ids.append(offset + word.id-1)
            last_id = word.id
        offset += last_id

    offset = 0

    for s in doc.sentences:
        last_id = 0
        for word in s.words:
            if word.deprel == 'amod':
                if s.words[word.head-1].text in assoc_nouns:
                    adjectives.append(word.text)
                    final_ids.append(word.id-1)      
            if word.text in assoc_nouns and word.deprel in ['obj', 'nsubj']:
                assoc_verbs.append(s.words[word.head-1].text)
                final_ids.append(word.head-1)
            last_id = word.id
        offset += last_id

    if len(sent)-4 >= 0 and sent[len(sent)-4] not in [',', '.', ';']:
        final_ids.append(len(sent)-4)
    if len(sent)-3 >= 0 and sent[len(sent)-3] not in [',', '.', ';']:
        final_ids.append(len(sent)-3)
    if len(sent)-2 >= 0 and sent[len(sent)-2] not in [',', '.', ';']:
        final_ids.append(len(sent)-2)

    return list(set(final_ids))

The function add_group_nums(sent) takes the MWP sentence (which has already been pre-processed to convert all number words like "thity-seven" to numeric values like "37") as input and outputs the list of group_nums. You can also use this for new examples at test time.

RishabhMaheshwary commented 3 years ago

Thanks you very much!!

TrieuLe0801 commented 1 year ago

Hello, I have a particular case. The 3 last tokens are the windows of size 3 around the last number, so after the process, the group_nums length is not enough. Have you ever seen this case?