Closed RishabhMaheshwary closed 3 years ago
We followed the procedure from the Graph2Tree paper (see section 3.1.2) describing how to obtain the components for the Quantity Cells which is what is represented by "group_nums".
Note that their code repository does not provide the way to obtain these "group_nums" either. Moreover, if you notice the group_nums for the MAWPS dataset used by them (and us, for the sake of consistency), you can see that the indexes are basically just a window of size 3 around each number in the sentence and the last 3 words. For SVAMP, we follow this, i.e. including the window around each number and the last three words, but we also try and implement the steps as mentioned in the paper i.e. obtaining associated nouns, adjectives, verbs etc via dependency parsing. This can be seen in the code snippet shown below: (We use the Stanza library for parsing the sentences)
import stanza
nlp_stanza = stanza.Pipeline(lang='en', processors='tokenize, pos, lemma, depparse')
def add_group_nums(sent):
sent = re.sub(r"-", r"", sent)
sent = re.sub(r"mrs.", r"mrs", sent)
sent_nums = re.findall('\d*\.?\d+', sent)
doc = nlp_stanza(sent)
sent = word_tokenize(sent)
final_ids = []
assoc_nouns = []
adjectives = []
assoc_verbs = []
rates = []
offset = 0
for s in doc.sentences:
last_id = 0
for word in s.words:
if word.text in sent_nums:
final_ids.append(offset + word.id-1)
if offset + (word.id-1) - 1 >= 0 and sent[offset + (word.id-1) - 1] not in [',', '.', ';']:
final_ids.append(offset + (word.id-1) - 1)
if offset + (word.id-1) + 1 < len(sent) and sent[offset + (word.id-1) + 1] not in [',', '.', ';']:
final_ids.append(offset + (word.id-1) + 1)
if word.deprel in ['nummod', 'nmode']:
assoc_nouns.append(s.words[word.head-1].text)
final_ids.append(offset + word.head-1)
if word.text in ['each', 'every', 'per']:
rates.append(word.text)
final_ids.append(offset + word.id-1)
last_id = word.id
offset += last_id
offset = 0
for s in doc.sentences:
last_id = 0
for word in s.words:
if word.deprel == 'amod':
if s.words[word.head-1].text in assoc_nouns:
adjectives.append(word.text)
final_ids.append(word.id-1)
if word.text in assoc_nouns and word.deprel in ['obj', 'nsubj']:
assoc_verbs.append(s.words[word.head-1].text)
final_ids.append(word.head-1)
last_id = word.id
offset += last_id
if len(sent)-4 >= 0 and sent[len(sent)-4] not in [',', '.', ';']:
final_ids.append(len(sent)-4)
if len(sent)-3 >= 0 and sent[len(sent)-3] not in [',', '.', ';']:
final_ids.append(len(sent)-3)
if len(sent)-2 >= 0 and sent[len(sent)-2] not in [',', '.', ';']:
final_ids.append(len(sent)-2)
return list(set(final_ids))
The function add_group_nums(sent)
takes the MWP sentence (which has already been pre-processed to convert all number words like "thity-seven" to numeric values like "37") as input and outputs the list of group_nums. You can also use this for new examples at test time.
Thanks you very much!!
Hello, I have a particular case. The 3 last tokens are the windows of size 3 around the last number, so after the process, the group_nums length is not enough. Have you ever seen this case?