Closed DeNeutoy closed 7 years ago
Thinking a bit more about this, if we go with this idea of having configurable datasets, to make read_from_file
not static and make users instantiate an instance of the Dataset which then fills itself, rather than returning a new instance of the class. It's a bit confusing to have a dataset with attributes which are used via Params
in read_from_file
before they are set in __init__
.
This is related to #328. And, actually, I think having DatasetReaders
is a better solution than having the Dataset
class read the data directly. The reason for this is that you can have two datasets that perform the same task, but have different formats. You don't want the model code to have to know if it needs a SquadDataset
, a WdwDataset
, or a TriviaQaDataset
object - that would get really messy, and require you to change modeling code whenever you add a new dataset. If you have a separate dataset reader script that puts each particular dataset into a common format, the modeling code can just know it needs a CharacterSpanInstance
, and point to files that are formatted correctly.
You could argue that you could put the dataset instance selection code into Trainer
or TextTrainer
, so the model itself doesn't have to know about the particulars of datasets, but then you still have to specify the class name in the json parameter file, and you have to allow for third-party users to have a way of adding their dataset class into that code, which is messy. It's a lot cleaner to have a third-party user just write their own dataset reader script to put their instances into a common format.
Addressing your three points:
The code actually already permits this. That's the whole point of the BackgroundInstance
, and why the API requires you to specify a list of files. Maybe this could be cleaner, and we could have done this for the SentenceSelectionInstances
also, but it's already modular.
The scala DatasetReader
code is already parameterized, and the python reader code can be too, so nothing is lost on this point.
The common intermediate format is actually important, not a bug, as described above. I think the barrier you're seeing to using the library is that the dataset reader code is in scala, not contained in this repo. I agree with you on that point.
I think the right way forward is to take the classes you've added here and convert them into DatasetReaders
in the readers module. What do you think?
I see what you mean about the problem of having the Trainer
know about the dataset, I hadn't thought of that issue. We do still have this problem with Models
anyway - could we solve it in the same way by allowing users to pass a Dataset
class to run_model
if they want to use one which is not supported? I think this is more of a question of how to support json file specifications - clearly if you want to use this code as a library and add models/classes, at some point you will not be able to specify json parameters directly unless you modify them after loading with actual classes or you add to our dicts of allowed types for different things.
In terms of BackgroundInstance
, that's true, it let's you do this, but it also breaks from the api of how every other instance is read from files - i.e. they are added to a Dataset
after reading the main instance type, not through calling read_from_file
.
The problem I see with having DatasetReaders
(presuming they stay as scripts) is that a user can't reuse a dataset with slightly different parameters, such as window_span for LM, without re-running their script and writing a new dataset file. If I want to do language modelling on all of wikipedia, I really don't want to have to have 3 copies of wikipedia with different window span lengths written into text files (overblown example but you get what I mean).
or you add to our dicts of allowed types for different things
Yeah, this was how I envisioned someone adding, e.g., a tuple matcher in third-party code. You import the dictionary, add a class, and then use methods from run.py
. You could also do this with concrete models, if you wanted to.
In terms of BackgroundInstance, that's true, it let's you do this, but it also breaks from the api of how every other instance is read from files
Yeah, it's not the cleanest thing, but I don't think it breaks the API much: https://github.com/allenai/deep_qa/blob/master/deep_qa/models/memory_networks/memory_network.py#L140-L142.
three copies of wikipedia
This is a good point, and a use case I hadn't thought of. I don't think it's a huge deal to have several copies of SQuAD around, for instance, in part because it's hard to imagine another set of parameters, except in modifying SQuAD for use in a different task, like answer sentence selection.
Also, for language modeling, you probably don't want to load the whole dataset into memory anyway (because it wouldn't fit), you want to scan across a directory in a data generator, so the whole data processing code would need to be rethought, or just bypassed, for the language modeling case.
Another consideration is that for quick iteration time in running test experiments, you typically want to have as little data processing done in your experiment code as possible. Hence #352, which would cut out ~10 minutes of waiting when trying to debug something on the full SQuAD dataset. If we move to loading the data as you're suggesting, we add processing time on top of what we already have.
So I think for all tasks we have except language modeling, the dataset reader approach is sufficient, and preferable to having Dataset
classes for each dataset format. Are there other tasks that are problematic for dataset readers?
@tusharkhot here is the discussion following chatting about loading in json graphs.
I agree that in general this is not a problem. However, I could imagine that this would make manipulating datasets really easy for weird and specific formulations which you want to try - imagine if you want to know how Bidaf performs on "What" questions. If the dataset can be passed parameters in the json file, you could just run an evaluation passing a what_questions_only
flag to the dataset or something.
I don't quite see why this is incompatible with serialising pre-indexed and tokenised data though. Part of that serialisation would be the dataset parameters, which are just loaded along with the pickled instances/ padded text data file. I agree that it wouldn't be compatible with different dataset parameters, but if you are changing these, you're probably running longer jobs where the amount of time spent padding/indexing is inconsequential.
What's the context on "loading in json graphs"?
Ok, so, in the interest of doing highest-priority work first, what are the actual problems with the current data loading code that we want to solve right now? It looks like you're thinking of doing your seq2seq stuff in DeepQA, is that right?
I think this PR is talking about a broader problem then what I was talking to Mark about. I was referring to the problem that currently the code assumes that every instance should fit in one line. With tuples/graphs encoded using JSON, we could still have a single instance in one line at the cost of readability. So this is not too much of a problem for me currently, but can be a problem E.g. what if my strings have newlines and I can't strip/replace them since my model uses these newlines.
So, I was wondering if you need the restriction of one line == one instance. Based on the discussion here, it seems that there is no such restriction; it is just not the common case scenario and harder to do. (right ?)
This wasn't really with the intention of doing seq2seq stuff; I was more looking at getting the OneNotes data for SRL in as an Instance and it is already formatted in a table, but with the annotations separate from the sentences which they annotate. This is a reasonable format which there is no need to force into another one, apart from the fact I can't read from multiple files using Instance.read_from_line
.
Also, I personally find the data loading part confusing and also people trying to add new data instances to deepQA have found it difficult - or maybe not, and that is just my impression.
Concrete problems:
Ok - doing an SNLI loader and talking about the WikiBio dataset made me think you were going for the seq2seq stuff. Having now thought about this a bit more, I think these two options aren't mutually exclusive. For datasets with different representations on disk, we still need to convert them into a common Instance
type, so those aren't going away. The only question is how to read the data from disk. We have two proposed solutions:
Instance
type be read from a single line, and that we have DatasetReader
scripts, which convert datasets into the common single-line format. This is limiting in the kinds of processing you can do, but it simplifies the code that reads in the data - you don't have to know about particular dataset readers in model code (it also simplifies some data creation pipeline stuff, but that's all in the scala code, anyway).Dataset
subclasses to read in their instances from a filesystem path, likely bypassing the read_from_line
methods on Instances
entirely. This allows for more flexibility in how datasets are stored and loaded, and removes the need for DatasetReader
scripts (largely containing the same code, just putting it in a different place), but it requires additional complexity in the modeling code.Both of these seem reasonable, and it's pretty easy to allow both - just add a key to the json file that specifies the dataset type, with TextDataset
being the default, and let the dataset class load stuff however it wants, defaulting to option (1) above. It'd also be pretty easy to add DatasetReader
scripts that just call the Dataset.read_from_path
methods and save the instances in a one-line-per-instance format.
So, yes, feel free to do it this way. Sorry for being ornery about it, and thanks for pushing back on my orneriness. This is a fine solution to the problem. But don't put all of the Dataset
objects in the same file - that would get really big really fast. Not sure if it'd be overkill to use the same format we have for instances and models, organizing things by task and then by dataset, but it'd at least be consistent.
Ah yes, that's true, let's do that. Yeah I only put them in the same one as I wasn't sure if you would go for this, so it was minimal effort - I was thinking of putting them in the same files as their respective instances, so that all the code you need to read in order to understand how an instance can be loaded for a given type is in one file? I guess also renaming them from instances
to ... data_types
or something?
Either we could group them with the instances they create, or we could group them by task and have files with each dataset name. The latter is probably better, because a single dataset could be used to produce different kinds of instances (e.g., SQuAD for QA, answer sentence selection, or question generation), but all need the same loading code. Also the latter would be easier for someone browsing the code to see if we can load a dataset they're thinking of. This also argues against grouping them by task, actually, because a single dataset can be used for more than one task. So maybe just a module under data
called datasets
, with files like squad.py
, snli.py
, etc.? The trouble is that we already have datasets.py
, so I'm not sure what to do there...
The idea behind this PR is to make the loading of datasets more flexible.
At the moment, the only way to load data is to have an
instance
contained within a line of a single file. This is ok for some types of data, but is very inefficient and clunky for others, such as data which naturally spans multiple files (such as the Wikipedia bio dataset) or sequence tagging datasets which use windows of text for training (e.g language modeling).This change requires subclassing
Dataset
for each type of instance, but allows 3 new things:span_length
parameter toread_from_file
to split it up into 20 window span lengths on the fly (or approximate spans if the data is not already tokenised), reading these window spans into instances directly.index###question###answer
etc) and read directly from new datasets in whatever format they provide. This removes a big barrier to using the library with a new dataset (in my opinion).Eventually, we could also remove
Instance.read_from_line
, replacing it withDataset
subclasses which know how to read that particular instance (or variants of it) from a directory. We only actually use theDataset.read_from_file
method in one place inTrainer
, so we can just pass the dataset class in the parameter file and call the respective subclasses method to load data.Comments/things I inevitably haven't thought of appreciated!