Let datasets know how to load themselves

allenai / deep_qa

A deep NLP library, based on Keras / tf, focused on question answering (but useful for other NLP too)

Apache License 2.0

404 stars 132 forks source link

Let datasets know how to load themselves #378

Closed DeNeutoy closed 7 years ago

DeNeutoy commented 7 years ago

The idea behind this PR is to make the loading of datasets more flexible.

At the moment, the only way to load data is to have an instance contained within a line of a single file. This is ok for some types of data, but is very inefficient and clunky for others, such as data which naturally spans multiple files (such as the Wikipedia bio dataset) or sequence tagging datasets which use windows of text for training (e.g language modeling).

This change requires subclassing Dataset for each type of instance, but allows 3 new things:

The ability to read data from a filepath, not just a file. Useful when the data is naturally represented across multiple files. This is good because it also makes data loading more modular - Switching different sentences from different sources for sentence selection would be easy if the dataset reader read in 5 files, where the ith row in each file composed the ith instance, etc.
The ability to pass parameters to the Dataset specifying how you want the data to be loaded. This is useful in the case of language modeling, as now we could have a dataset which is just text, and pass a span_length parameter to read_from_file to split it up into 20 window span lengths on the fly (or approximate spans if the data is not already tokenised), reading these window spans into instances directly.
We can cut out the intermediate representation of instances we are using (index###question###answer etc) and read directly from new datasets in whatever format they provide. This removes a big barrier to using the library with a new dataset (in my opinion).

Eventually, we could also remove Instance.read_from_line, replacing it with Dataset subclasses which know how to read that particular instance (or variants of it) from a directory. We only actually use the Dataset.read_from_file method in one place in Trainer, so we can just pass the dataset class in the parameter file and call the respective subclasses method to load data.

Comments/things I inevitably haven't thought of appreciated!

DeNeutoy commented 7 years ago

Thinking a bit more about this, if we go with this idea of having configurable datasets, to make read_from_file not static and make users instantiate an instance of the Dataset which then fills itself, rather than returning a new instance of the class. It's a bit confusing to have a dataset with attributes which are used via Params in read_from_file before they are set in __init__.

matt-gardner commented 7 years ago

This is related to #328. And, actually, I think having DatasetReaders is a better solution than having the Dataset class read the data directly. The reason for this is that you can have two datasets that perform the same task, but have different formats. You don't want the model code to have to know if it needs a SquadDataset, a WdwDataset, or a TriviaQaDataset object - that would get really messy, and require you to change modeling code whenever you add a new dataset. If you have a separate dataset reader script that puts each particular dataset into a common format, the modeling code can just know it needs a CharacterSpanInstance, and point to files that are formatted correctly.

You could argue that you could put the dataset instance selection code into Trainer or TextTrainer, so the model itself doesn't have to know about the particulars of datasets, but then you still have to specify the class name in the json parameter file, and you have to allow for third-party users to have a way of adding their dataset class into that code, which is messy. It's a lot cleaner to have a third-party user just write their own dataset reader script to put their instances into a common format.

Addressing your three points:

The code actually already permits this. That's the whole point of the BackgroundInstance, and why the API requires you to specify a list of files. Maybe this could be cleaner, and we could have done this for the SentenceSelectionInstances also, but it's already modular.
The scala DatasetReader code is already parameterized, and the python reader code can be too, so nothing is lost on this point.
The common intermediate format is actually important, not a bug, as described above. I think the barrier you're seeing to using the library is that the dataset reader code is in scala, not contained in this repo. I agree with you on that point.

I think the right way forward is to take the classes you've added here and convert them into DatasetReaders in the readers module. What do you think?

DeNeutoy commented 7 years ago

I see what you mean about the problem of having the Trainer know about the dataset, I hadn't thought of that issue. We do still have this problem with Models anyway - could we solve it in the same way by allowing users to pass a Dataset class to run_model if they want to use one which is not supported? I think this is more of a question of how to support json file specifications - clearly if you want to use this code as a library and add models/classes, at some point you will not be able to specify json parameters directly unless you modify them after loading with actual classes or you add to our dicts of allowed types for different things.

In terms of BackgroundInstance, that's true, it let's you do this, but it also breaks from the api of how every other instance is read from files - i.e. they are added to a Dataset after reading the main instance type, not through calling read_from_file.

The problem I see with having DatasetReaders (presuming they stay as scripts) is that a user can't reuse a dataset with slightly different parameters, such as window_span for LM, without re-running their script and writing a new dataset file. If I want to do language modelling on all of wikipedia, I really don't want to have to have 3 copies of wikipedia with different window span lengths written into text files (overblown example but you get what I mean).

matt-gardner commented 7 years ago

or you add to our dicts of allowed types for different things

Yeah, this was how I envisioned someone adding, e.g., a tuple matcher in third-party code. You import the dictionary, add a class, and then use methods from run.py. You could also do this with concrete models, if you wanted to.

In terms of BackgroundInstance, that's true, it let's you do this, but it also breaks from the api of how every other instance is read from files

Yeah, it's not the cleanest thing, but I don't think it breaks the API much: https://github.com/allenai/deep_qa/blob/master/deep_qa/models/memory_networks/memory_network.py#L140-L142.

three copies of wikipedia

This is a good point, and a use case I hadn't thought of. I don't think it's a huge deal to have several copies of SQuAD around, for instance, in part because it's hard to imagine another set of parameters, except in modifying SQuAD for use in a different task, like answer sentence selection.

Also, for language modeling, you probably don't want to load the whole dataset into memory anyway (because it wouldn't fit), you want to scan across a directory in a data generator, so the whole data processing code would need to be rethought, or just bypassed, for the language modeling case.

Another consideration is that for quick iteration time in running test experiments, you typically want to have as little data processing done in your experiment code as possible. Hence #352, which would cut out ~10 minutes of waiting when trying to debug something on the full SQuAD dataset. If we move to loading the data as you're suggesting, we add processing time on top of what we already have.

So I think for all tasks we have except language modeling, the dataset reader approach is sufficient, and preferable to having Dataset classes for each dataset format. Are there other tasks that are problematic for dataset readers?

DeNeutoy commented 7 years ago

@tusharkhot here is the discussion following chatting about loading in json graphs.

DeNeutoy commented 7 years ago

I agree that in general this is not a problem. However, I could imagine that this would make manipulating datasets really easy for weird and specific formulations which you want to try - imagine if you want to know how Bidaf performs on "What" questions. If the dataset can be passed parameters in the json file, you could just run an evaluation passing a what_questions_only flag to the dataset or something.

I don't quite see why this is incompatible with serialising pre-indexed and tokenised data though. Part of that serialisation would be the dataset parameters, which are just loaded along with the pickled instances/ padded text data file. I agree that it wouldn't be compatible with different dataset parameters, but if you are changing these, you're probably running longer jobs where the amount of time spent padding/indexing is inconsequential.

matt-gardner commented 7 years ago

What's the context on "loading in json graphs"?

Ok, so, in the interest of doing highest-priority work first, what are the actual problems with the current data loading code that we want to solve right now? It looks like you're thinking of doing your seq2seq stuff in DeepQA, is that right?

tusharkhot commented 7 years ago

I think this PR is talking about a broader problem then what I was talking to Mark about. I was referring to the problem that currently the code assumes that every instance should fit in one line. With tuples/graphs encoded using JSON, we could still have a single instance in one line at the cost of readability. So this is not too much of a problem for me currently, but can be a problem E.g. what if my strings have newlines and I can't strip/replace them since my model uses these newlines.

So, I was wondering if you need the restriction of one line == one instance. Based on the discussion here, it seems that there is no such restriction; it is just not the common case scenario and harder to do. (right ?)

DeNeutoy commented 7 years ago

This wasn't really with the intention of doing seq2seq stuff; I was more looking at getting the OneNotes data for SRL in as an Instance and it is already formatted in a table, but with the annotations separate from the sentences which they annotate. This is a reasonable format which there is no need to force into another one, apart from the fact I can't read from multiple files using Instance.read_from_line.

Also, I personally find the data loading part confusing and also people trying to add new data instances to deepQA have found it difficult - or maybe not, and that is just my impression.

Concrete problems:

It is difficult to quickly alter logic for loading in a dataset.
Personally I think that having an intermediate data format is not helpful, but I am happy to be overruled here if I am the only one. If so I can just close the PR.
Being able to read datasets from a filepath rather than from a line means that we aggregate all data loading code under the same api and it is strictly more general.

matt-gardner commented 7 years ago

Ok - doing an SNLI loader and talking about the WikiBio dataset made me think you were going for the seq2seq stuff. Having now thought about this a bit more, I think these two options aren't mutually exclusive. For datasets with different representations on disk, we still need to convert them into a common Instance type, so those aren't going away. The only question is how to read the data from disk. We have two proposed solutions:

the current way, which requires that every Instance type be read from a single line, and that we have DatasetReader scripts, which convert datasets into the common single-line format. This is limiting in the kinds of processing you can do, but it simplifies the code that reads in the data - you don't have to know about particular dataset readers in model code (it also simplifies some data creation pipeline stuff, but that's all in the scala code, anyway).
Allow Dataset subclasses to read in their instances from a filesystem path, likely bypassing the read_from_line methods on Instances entirely. This allows for more flexibility in how datasets are stored and loaded, and removes the need for DatasetReader scripts (largely containing the same code, just putting it in a different place), but it requires additional complexity in the modeling code.

Both of these seem reasonable, and it's pretty easy to allow both - just add a key to the json file that specifies the dataset type, with TextDataset being the default, and let the dataset class load stuff however it wants, defaulting to option (1) above. It'd also be pretty easy to add DatasetReader scripts that just call the Dataset.read_from_path methods and save the instances in a one-line-per-instance format.

So, yes, feel free to do it this way. Sorry for being ornery about it, and thanks for pushing back on my orneriness. This is a fine solution to the problem. But don't put all of the Dataset objects in the same file - that would get really big really fast. Not sure if it'd be overkill to use the same format we have for instances and models, organizing things by task and then by dataset, but it'd at least be consistent.

DeNeutoy commented 7 years ago

Ah yes, that's true, let's do that. Yeah I only put them in the same one as I wasn't sure if you would go for this, so it was minimal effort - I was thinking of putting them in the same files as their respective instances, so that all the code you need to read in order to understand how an instance can be loaded for a given type is in one file? I guess also renaming them from instances to ... data_types or something?

matt-gardner commented 7 years ago

Either we could group them with the instances they create, or we could group them by task and have files with each dataset name. The latter is probably better, because a single dataset could be used to produce different kinds of instances (e.g., SQuAD for QA, answer sentence selection, or question generation), but all need the same loading code. Also the latter would be easier for someone browsing the code to see if we can load a dataset they're thinking of. This also argues against grouping them by task, actually, because a single dataset can be used for more than one task. So maybe just a module under data called datasets, with files like squad.py, snli.py, etc.? The trouble is that we already have datasets.py, so I'm not sure what to do there...