[Discussion] Dataset Abstraction

GabrielKP commented 2 years ago

When working with multiple documents, it possibly would be great to include some sort of abstraction for multiple documents forming a Dataset.

The Dataset would then allow:

Caching of Datasets after doing transformations on them (which I am currently requiring), e.g. in a Pipelined workflow
Saving and loading Datasets efficiently.

@ArneBinder and I were discussing using HuggingFace Datasets as base class, which would bring the hole power of HuggingFace Datasets to pytorch-ie.

ArneBinder commented 2 years ago

@GabrielKP Thanks for opening this. This is related to or depends on #120.

ChristophAlt commented 2 years ago

@GabrielKP @ArneBinder Thanks for kicking of the discussion.

@GabrielKP : What kind of transformations do you currently require? Are these applied before you turn the HF dataset into a collection of PIE documents?

ChristophAlt commented 2 years ago

I've also investigated the use of Huggingface's Dataset class as the base class before going on vacation. The main challenge is how to represent a document, specifically the annotations, as an arrow table.

Let's assume that we have a document with text, an id, metadata, and an arbitrary number of annotations of the types described in #120.

The main questions are:

How to represent the annotation types as arrow schemas / types?
Is it necessary to support mapping of these annotations (when represented as an arrow table)?
How to represent the difference between groundtruth and annotation / prediction?
How does this affect the operations currently supported by HF Dataset?
What kind of operations do we want to support?
How does this work with the inference / evaluation pipeline? E.g. how costly / feasible is it to iteratively add predictions to an instance during inference.

ChristophAlt commented 2 years ago

After spending some thoughts on this and related topics, I think the initial step would be to create a dataset for development that reflects all the aspects we want to implement initially. The dataset should probably contain 3-5 example documents with relevant annotations in a common data format. Based on that we can go through all the steps necessary to go from loading a dataset to training, evaluation, and inference.

Each document should:

contain text (at least two with multiple sentences)
contain a document id
contain some arbitrary metadata
sentence annotations
named entity annotations (where applicable)
relation annotations (where applicable)

Important edge cases:

sentences without entity annotations (and thus without relation annotations)
sentences without relation annotations
sentences without relation annotations but more than two entities
single sentence per document
single sentence per document without annotations
annotations that span multiple sentences

Other considerations:

the file format should probably be JSON (maybe the dygiepp format)
the documents should be as short as possible

The steps to be prototyped are:

loading the dataset as a HF dataset / dataset dict
converting the HF dataset / dataset dict to PIE dataset / dataset dict (document abstraction)
preparing the model inputs (dict of tensors and other non-tensor metadata)
- consider the special case of splitting a document into multiple model inputs
- consider the special case of windowing
prepare targets
caching of model inputs and targets (?)
creating and running a pipeline (one stage and two stage) (that adds predictions to a document)
- with gold labels (groundtruth) required to create an input (e.g. relation extraction)
- with predictions to create an input (e.g. relation extraction as the second stage)

ArneBinder commented 2 years ago

Thanks a lot for the write up! Some remarks for the dummy dataset:

Can we also go with a predifined test split?
additional edge cases:
- relation that spans two sentences

I'm quite curious about the pipeline setup. Do we have to distinguish between an evaluation and a prediction pipeline? I don't think it is useful to answer that now, just collecting further directions of thinking.

Regarding the data format: https://github.com/dwadden/dygiepp looks to be already tokenized. I think this is not favorable. However, we should keep also this case in mind.

ChristophAlt commented 2 years ago

I added annotations span multiple sentences as an edge case.

Could you elaborate on the predefined test split?
I'm honestly not sure yet if we have to distinguish evaluation and prediction pipeline -- I agree to postpone that part and see as we progress.

You're right regarding the data format, tokenized is not what we want. I'd suggest to use a dataset of the following simple format then (the file contains a list of these json objects):

[
{
"text": "A works at B.",
"id": "ABC1234",
"metadata": { "key1": "value1", "key2": 12345 },
"sentences": [ {"start": 0, "end": 13} ]
"entities": [ {"start": 0, "end": 1, "label": "PER"}, {"start": 11, "end": 12, "label": "ORG"} ],
"relations": [ {"head": 0, "tail": 1, "label": "per:employee_of"} ]
},
...
]

ArneBinder commented 2 years ago

This looks quite good!

Could you elaborate on the predefined test split?

I think it is a common case that a dataset contains a test set and often also a validation set. The idea was just to model this in our dummy data, e.g. by having a single document dedicated to it, to increase visibility of that and to recognize any edge cases related to having multiple splits early.

ChristophAlt commented 2 years ago

Please have a look at the dataset.

I added examples for all of the usecases (hopefully).
I added two validation examples to the training data, which can be used when simulating splitting the train dataset into train and validation split.
I added a dedicated validation as well.
I added a test set.

ArneBinder commented 2 years ago

I created https://github.com/ChristophAlt/pytorch-ie/pull/129 to document our decisions and to visualize how individual concepts are related and connected. @ChristophAlt please have a look.

ArneBinder commented 2 years ago

Please have a look at the dataset.

I created a PR for that: https://github.com/ChristophAlt/pytorch-ie/pull/130

GabrielKP commented 2 years ago

What kind of transformations do you currently require? Are these applied before you turn the HF dataset into a collection of PIE documents

I am marking my dataset with NER-tags and dependency trees, e.g. with Stanza. After that I mark specific pairings of certain entities found beforehand.

Although this could be done before turning the dataset into a PIE documents, I implemented it (and think it may be better) to happen after the conversion, as I can imagine setting up the project so that I can later enhance these annotations without deleting the old ones by using a more sophisticated method.

Sorry if I am not completely following you on the details of this, but you can ask me about my opinion anytime!

ChristophAlt commented 2 years ago

I have an initial version of a Document that is JSON-serializable and de-serializable, and thus can be persisted (and cached) as an Arrow table. See here.

It still looks a little rough around the edges but the changes can be summarized as follows:

users can now define their own Documents (and annotations) in a dataclass-style fashion
annotations are now dataclasses to simplify things like hash computation and enforcing frozen representations
a document can be serialized as a dictionary of primitive types, references are replaced by object hashes
a document can be deserialized from such a dictionary, the dependency chain of annotations is resolved too, e.g. entities must be deserialized before relations, because the latter depend on the former (this is inferred from the target graph, i.e. the graph that contains the dependencies between annotations, e.g. entities --> text, sentences --> text, relations --> entities).

There are still a few things missing:

Get datasets to transparently convert from Document to dict and vice versa (this is possible, HF datasets does this for Image and Audio features, however the current implementation doesn't allow for an easy extension)
The current assumption that annotation fields are always lists of annotations (e.g. a list of spans) but there may be fields that only contain a single annotation, e.g. a document-level label.
How to continue with the groundtruth / predictions concept for an annotation field? This only concerns evaluation and inference (when operating directly on documents), but has no impact on training.

ChristophAlt commented 2 years ago

I have made further changes to the initial version. See here.

The changes can be summarized as follows:

All Annotations derive from AnnotationBase, which has a single target attribute that references the annotation it targets (e.g. entities -> text) or None if no target is specified.
List[SomeAnnotation] has been replaced with an AnnotationList[SomeAnnotation], which is essentially a wrapper around a list and is responsible for setting the target attribute of an annotation when added to a document.
The two changes allow to conveniently retrieve information of the targeted annotation, e.g. the text of a Span

Missing changes:

add Annotation[SomeAnnotation], which is the equivalent of AnnotationList but for single annotations. The dataclass annotation is then document_level_annotation: Annotation[Label] = annotation_field(target=...) or document_level_annotation: Annotation[Label] if no target is required.
Annotation and AnnotationList have a prediction or predictions field that holds the corresponding predictions. This would neatly tie the groundtruth annotations and predictions and simplify evaluation logic that operates on annotations directly instead of some derived targets, e.g. as done during the training stage.
Allow document attributes to be accessed in dictionary style, e.g. document["entities"], throw an exception if the key is missing. This would simplify accessing annotations in taskmodules as we typically specify the fields as strings in the constructor.

GabrielKP commented 2 years ago

This looks extraordinary!

I am going to be that guy: If you want your framework to be used by people you can already think of how and where to leave helpful comments to understand what to do. I know it is pretty early, but it might alleviate a lot of pain later.

The reasoning behind this is that you are using quite elegant and sophisticated methods (e.g. dataclassed, fields): these things are bit like magic if you do not know how they work and I would definitely assume that many of the people you are targeting to use this tool will not know what they are. Although not hard to learn, it definitely is a psychological barrier (I am noticing it with myself, could be that I am exception though).

I am especially a bit confused about this as it is unclear at first what target is and does, but (as a user) I will need to use that in my definitions.

ChristophAlt commented 2 years ago

Thanks for the valuable feedback! @GabrielKP

You're absolutely correct in your assessment. The whole concept needs much more explanation and it's best to provide examples of how to use it. The current status of the implementation is sub optimal, to say the least. 😄 I just wanted to make sure that it actually works, because there are so many things, in particular in HF datasets, that had to be adapted and extended to fit our needs. I'm extremely happy with the current proof-of-concept version --if we ignore the fact that the code is all in one file and looks crazy bad. But it does everything that is needed without making any changes to HF datasets or any external dependencies. I'm currently in the progress of implementing this properly. The most important features are:

It let's us define Documents and Annotations of any type, which can be serialized to dictionaries of primitive types. Importantly, there are no restrictions on annotations, users can define their own, if necessary.
The ability to serialize Documents allows us to store it as an Apache Arrow table and use all the functionality of a HF datasets Dataset and DatasetDict (although some of it doesn't make much sense when operating on Documents).
There is a Dataset implementation that transparently serializes and deserializes Documents from the underlying Arrow table. This means that accessing a Dataset returns Document instances, e.g. dataset[0], dataset[0:2], and dataset[[0, 1, 2]] return the deserialized documents. Also dataset.map(...) gives you a document, or a list of documents if batched=True. The result of all maps are cached, even when making changes to the actual documents!
Finally, there now exists a GeneratorBasedBuilder similar to the one in HF datasets that has an additional method _generate_document, which is responsible for converting the original datasets feature dictionary into a document, and then returns a document-based Dataset. If you want to implement a dataset that doesn't exist in HF datasets, you have to implement all the abstract functions, e.g. _generate_examples; if there exists a HF dataset, you only have to implement _generate_document and specify the base dataset path, e.g. "conll2003", everything is then taken care of.
The coolest part is that the GeneratorBasedBuilder script file can be uploaded to the HF hub or a Git repository and be loaded similar to the ones from HF datasets. That means, we can load our document-based datasets just like normal HF datasets with load_dataset("pie/conll2003") and it returns a DatasetDict with document-based Datasets.

ArneBinder commented 2 years ago

This looks really promising! But I still need to read all the content thoroughly to give more detailed feedback.

@GabrielKP Just a short comment regarding

I am especially a bit confused about this as it is unclear at first what target is and does, but (as a user) I will need to use that in my definitions.

target in this case means the target of annotation when imaging annotation layers as a graph. E.g. a span annotation is targeting text by assigning a label to a certain part of it. In the same way, a relation annotation is targeting entities by assigning a label to an ordered pair of them.

We may think about renaming this to sth like base since target is already a bit overloaded in the context of encode_target in the taskmodule. However, after thinking about this a bit, it looks like a better fit here and we should rethink to use target in the taskmodule context (encode_target etc).

This just comes to my mind: Is there any usecase/need to have multiple targets/bases for one annotation?

ChristophAlt commented 2 years ago

Many of the points discussed here are now implemented by #130.

We now have a first-class integration of HF datasets and datasets / dataset loading scripts uploaded to the model hub. It's now possible to use upload and use loading scripts that directly return Documents, see here for an example with the CoNLL03 dataset. The corresponding dataset loading script is stored on HF dataset hub, see here.

ArneBinder / pytorch-ie

[Discussion] Dataset Abstraction #123