AlexKuhnle / ShapeWorld

MIT License
58 stars 18 forks source link

ShapeWorld

Getting started

git clone --recursive https://github.com/AlexKuhnle/ShapeWorld.git
pip3 install -e .  # optional: .[full] or .[full-gpu]

Table of content

About ShapeWorld

ShapeWorld is a framework which allows to specify generators for abstract, visually grounded language data (or just visual data).

The main motivation behind ShapeWorld is to provide a new testbed and evaluation methodology for visually grounded language understanding, particularly aimed at deep learning models. It differs from standard evaluation datasets in two ways: Firstly, data is randomly sampled during training and evaluation according to constraints specified by the experimenter. Secondly, its focus of evaluation is on linguistic understanding capabilities of the type investigated by formal semantics. In this context, the ShapeWorld tasks can be thought of as unit-testing multimodal models for specific linguistic generalization abilities -- similar to, for instance, the bAbI tasks of Weston et al. (2015) for text-only understanding.

The code is written in Python 3. The data can either be obtained within a Python module as NumPy arrays, and hence integrates into deep learning projects based on common frameworks like TensorFlow, PyTorch or Theano, or it can be extracted into separate files. Both options are described further below. For language generation, the Python package pydmrs (Copestake et al., 2016) is required.

I am interested in hearing about any applications you plan to use the ShapeWorld data for. In particular, let me know if you have a great idea in mind that you are interested in investigating with such abstract data, but which the current setup does not allow to realize -- I am happy to collaboratively find a way to make it happen.

Contact: alexkuhnle (at) t-online.de

If you use ShapeWorld in your work, please cite:

ShapeWorld: A new test methodology for multimodal language understanding (arXiv)
Alexander Kuhnle and Ann Copestake (April 2017)

Example data

Command lines for generation can be found here.

Caption agreement datasets

Classification datasets

Integration into Python code

The easiest way to use the ShapeWorld data in your Python project is to directly call it from the code. Whenever a batch of training/evaluation instances is required, dataset.generate(...) is called with the respective arguments. This means that generation happens simultaneously to training/testing. Below an example of how to generate a batch of 128 training instances. See also the example models below.

from shapeworld import Dataset

dataset = Dataset.create(dtype='agreement', name='existential')
generated = dataset.generate(n=128, mode='train', include_model=True)

print('world shape:', dataset.world_shape())
print('caption shape:', dataset.vector_shape(value_name='caption'))
print('vocabulary size:', dataset.vocabulary_size(value_type='language'))
print('vocabulary:', dataset.vocabulary)

# caption surface forms
print('first few captions:')
print('\n'.join(dataset.to_surface(value_type='language', word_ids=generated['caption'][:5])))

# given to the image caption agreement model
batch = (generated['world'], generated['caption'], generated['caption_length'], generated['agreement'])

# can be used for more specific evaluation
world_model = generated['world_model']
caption_model = generated['caption_model']

Alternatively, dataset.iterate(...) returns a batch generator, with the optional argument iterations specifying a fixed number of iterations:

from shapeworld import Dataset

dataset = Dataset.create(dtype='agreement', name='existential')
for batch in dataset.iterate(n=64, mode='train', include_model=True, iterations=5):
    # iterations argument optional
    print(len(batch['world']))

The agreement datasets offer a parameter worlds_per_instance or captions_per_instance (exclusive) to generate multiple worlds/captions per instance. To actually retrieve these alternatives, the alternatives flag has to be set:

from shapeworld import Dataset

dataset = Dataset.create(dtype='agreement', name='existential', captions_per_instance=3)

generated = dataset.generate(n=1)
print('caption:', generated['caption'][0])  # one caption

generated = dataset.generate(n=1, alternatives=True)
print('world:', type(generated['world'][0]))  # one world
print('captions:', ', '.join(str(caption) for caption in generated['caption'][0]))  # three captions
print('agreements:', ', '.join(str(agreement) for agreement in generated['agreement'][0]))  # three agreement values

Stand-alone data generation

The shapeworld/generate.py module provides options to generate ShapeWorld data in separate files via the command line. Use cases include:

The following command line arguments are available:

When creating larger amounts of ShapeWorld data, it is advisable to store the data in a compressed archive (for example -a tar:bz2). For instance, the following command line generates one million training instances of the existential configuration file included in this repository:

python generate.py -d [DIRECTORY] -a tar:bzip2 -t agreement -n existential -m train -s 100 -i 10k -M

For the purpose of this introduction, we generate a smaller amount of all training (TensorFlow records and raw), validation and test instances using the default configuration of the dataset:

python generate.py -d examples/readme -a tar:bzip2 -t agreement -n existential -v readme -s 3,2,1 -i 100 -M -T

Loading extracted data

Extracted data can be loaded and accessed with the same Dataset interface as before, just define the config argument as [DIRECTORY]:

from shapeworld import Dataset

dataset = Dataset.create(dtype='agreement', name='existential', variant='readme', config='examples/readme')
generated = dataset.generate(n=128, mode='train')

Besides the batch generator functionality dataset.iterate(...), loaded datasets offer an epoch batch generator via dataset.epoch(...) which terminates after one iteration over the entire dataset (with the last batch potentially being smaller than the specified n):

from shapeworld import Dataset

dataset = Dataset.create(dtype='agreement', name='existential', variant='readme', config='examples/readme')
for batch in dataset.epoch(n=64, mode='train', include_model=True):
    print(len(batch['world']))  # 64, 64, 64, 64, 44 (300 overall)

Loading the data in Python and then feeding it to a model is relatively slow. By using TensorFlow (TF) records (see above for how to generate TF records) and consequently the ability to load data implicitly within TensorFlow, models can be trained significantly faster. ShapeWorld provides utilities to access TF records as generated/loaded data would be handled:

from shapeworld import Dataset, tf_util

dataset = Dataset.create(dtype='agreement', name='existential', variant='readme', config='examples/readme')
generated = tf_util.batch_records(dataset=dataset, mode='train', batch_size=128)

The generated Tensor cannot immediately be evaluated as it requires the TF queue runners to be initialized:

import tensorflow as tf

with tf.Session() as session:
    coordinator = tf.train.Coordinator()
    queue_threads = tf.train.start_queue_runners(sess=session, coord=coordinator)

    # session calls, for instance:
    batch = session.run(fetches=generated)

    coordinator.request_stop()
    coordinator.join(threads=queue_threads)

CLEVR and NLVR interface

CLEVR can be obtained as follows (alternatively, replace CLEVR_v1.0 with CLEVR_CoGenT_v1.0 for the CLEVR CoGenT dataset):

wget https://s3-us-west-1.amazonaws.com/clevr/CLEVR_v1.0.zip
unzip CLEVR_v1.0.zip
rm CLEVR_v1.0.zip

ShapeWorld then provides a basic interface to load the CLEVR instances in order of their appearance in the dataset. It is hence recommended to 'pre-generate' the entire dataset (70k training, 15k validation and 15k test instances) once through the ShapeWorld interface, either as clevr_classification or clevr_answering dataset type, and subsequently access it as you would load other pre-generated ShapeWorld datasets:

python generate.py -d [SHAPEWORLD_DIRECTORY] -a tar:bzip2 -t clevr_classification -n clevr -s 140,30,30 -i 500 -M -T --config-values --directory CLEVR_v1.0
rm -r CLEVR_v1.0

Accordingly, in the case of CLEVR CoGenT:

python generate.py -d [SHAPEWORLD_DIRECTORY] -a tar:bzip2 -t clevr_classification -n clevr -s 140,30,30 -i 500 -M -T --config-values --directory CLEVR_CoGenT_v1.0 --parts '["A", "A", "A"]'
python generate.py -d [SHAPEWORLD_DIRECTORY] -a tar:bzip2 -t clevr_classification -n clevr -s 0,30,30 -i 500 -M -T --config-values --directory CLEVR_CoGenT_v1.0 --parts '["A", "B", "B"]'
rm -r CLEVR_CoGenT_v1.0

As clevr_classification dataset, it provides:

As clevr_answering dataset, the last value is replaced by:

Equivalently, NLVR can be obtained via:

git clone https://github.com/cornell-lic/nlvr.git

Again, one should 'pre-generate' the entire dataset (75k training, 6k validation and 6k test instances) as nlvr_agreement dataset type, and subsequently access it via the ShapeWorld load interface:

python generate.py -d [SHAPEWORLD_DIRECTORY] -a tar:bzip2 -t nlvr_agreement -n nlvr -s 25,2,2 -i 3k -M -T --config-values --directory nlvr
rm -r nlvr

The dataset provides:

Evaluation and example models

The models/ directory contains a few exemplary models based on TFMacros, my collection of TensorFlow macros. The scripts train.py and evaluate.py provide the following command line arguments to train and evaluate these models ((t) train-only, (e) evaluate-only):

For instance, the following command line trains an image caption agreement system on the existential dataset:

python train.py -t agreement -n existential -m cnn_bow -i 5k

The previously generated data (here: TF records) can be loaded in the same way as was explained for loading the data in Python code:

python train.py -t agreement -n existential -v readme -c examples/readme -m cnn_bow -i 10 -T --model-dir [MODEL_DIRECTORY]
python evaluate.py -t agreement -n existential -v readme -c examples/readme -m cnn_bow -i 10 --model-dir [MODEL_DIRECTORY]