The task of solving Math Word Problems (MWPs) has received significant research attention in the past years. An MWP consists of a short Natural Language narrative that describes a state of the world and poses a question about some unknown quantities (see Table 1 for examples).
In this work, we show deficiencies in two benchmark datasets - ASDiv-A and MAWPS. We first show that existing models achieve reasonably high accuracies on these datasets even after removing the "question" part of the MWP at test time. We further show that a simple model without any word-order information can also solve a majority of MWPs in these datasets. Our experiments indicate that existing models rely on shallow heuristics in benchmark MWP datasets for achieving high performance.
Our experiments render the benchmark datasets unreliable to measure model performance. To enable more robust evaluation of automatic MWP solvers, we created a challenge set called "SVAMP". The examples in SVAMP test a model across different aspects of solving MWPs. Table 1 provides three examples from SVAMP that test whether a model is Question-sensitive, has robust reasoning ability or is invariant to structural alterations respectively.
SVAMP/code/requirements.txt
Install VirtualEnv using the following (optional):
$ [sudo] pip install virtualenv
Create and activate your virtual environment (optional):
$ virtualenv -p python3 venv
$ source venv/bin/activate
Install all the required packages:
at SVAMP/code:
$ pip install -r requirements.txt
To create the relevant directories, run the following command in the corresponding directory of that model:
for eg, at SVAMP/code/graph2tree:
$ sh setup.sh
Then transfer all the data folders to the data subdirectory of that model. For example, copy the MAWPS data directory i.e. cv_mawps
from SVAMP/data
to SVAMP/code/graph2tree/data/
.
The current repository includes 5 implementations of Models:
SVAMP/code/rnn_seq2seq
SVAMP/code/transformer_seq2seq
SVAMP/code/gts
SVAMP/code/graph2tree
SVAMP/code/constrained
We work with the following datasets:
mawps
asdiv-a
svamp
SVAMP/SVAMP.json
Data Size:
1000A description of the individual data files in the SVAMP/data
directory is given below:
SVAMP/data/cv_asdiv-a
SVAMP/data/cv_asdiv-a_without_questions
SVAMP/data/cv_mawps
SVAMP/data/cv_mawps_without_questions
SVAMP/data/mawps-asdiv-a_svamp
SVAMP/data/mawps-asdiv-a_svamp_without_questions
SVAMP/data/cv_svamp_augmented
The set of command line arguments available can be seen in the respective args.py
file. Here, we illustrate running the experiment for cross validation of the ASDiv-A dataset using the Seq2Seq model. Follow the same methodology for running any experiment over any model.
If the folders for the 5 folds are kept as subdirectories inside the directory ../data/cv_asdiv-a:
(for eg, fold0 directory will have ../data/cv_asdiv-a/fold0/train.csv
and ../data/cv_asdiv-a/fold0/dev.csv
),
then, at SVAMP/code/rnn_seq2seq:
$ python -m src.main -mode train -gpu 0 -embedding roberta -emb_name roberta-base -emb1_size 768 -hidden_size 256 -depth 2 -lr 0.0002 -emb_lr 8e-6 -batch_size 4 -epochs 50 -dataset cv_asdiv-a -full_cv -run_name run_cv_asdiv-a
If you use our data or code, please cite our work:
@inproceedings{patel-etal-2021-nlp,
title = "Are {NLP} Models really able to Solve Simple Math Word Problems?",
author = "Patel, Arkil and
Bhattamishra, Satwik and
Goyal, Navin",
booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
month = jun,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.naacl-main.168",
doi = "10.18653/v1/2021.naacl-main.168",
pages = "2080--2094",
abstract = "The problem of designing NLP solvers for math word problems (MWP) has seen sustained research activity and steady gains in the test accuracy. Since existing solvers achieve high performance on the benchmark datasets for elementary level MWPs containing one-unknown arithmetic word problems, such problems are often considered {``}solved{''} with the bulk of research attention moving to more complex MWPs. In this paper, we restrict our attention to English MWPs taught in grades four and lower. We provide strong evidence that the existing MWP solvers rely on shallow heuristics to achieve high performance on the benchmark datasets. To this end, we show that MWP solvers that do not have access to the question asked in the MWP can still solve a large fraction of MWPs. Similarly, models that treat MWPs as bag-of-words can also achieve surprisingly high accuracy. Further, we introduce a challenge dataset, SVAMP, created by applying carefully chosen variations over examples sampled from existing datasets. The best accuracy achieved by state-of-the-art models is substantially lower on SVAMP, thus showing that much remains to be done even for the simplest of the MWPs.",
}
For any clarification, comments, or suggestions please contact Arkil or Satwik.