Closed GabrielKP closed 2 years ago
@GabrielKP Thanks for opening this. This is related to or depends on #120.
@GabrielKP @ArneBinder Thanks for kicking of the discussion.
@GabrielKP : What kind of transformations do you currently require? Are these applied before you turn the HF dataset into a collection of PIE documents?
I've also investigated the use of Huggingface's Dataset class as the base class before going on vacation. The main challenge is how to represent a document, specifically the annotations, as an arrow table.
Let's assume that we have a document with text, an id, metadata, and an arbitrary number of annotations of the types described in #120.
The main questions are:
After spending some thoughts on this and related topics, I think the initial step would be to create a dataset for development that reflects all the aspects we want to implement initially. The dataset should probably contain 3-5 example documents with relevant annotations in a common data format. Based on that we can go through all the steps necessary to go from loading a dataset to training, evaluation, and inference.
Each document should:
Important edge cases:
Other considerations:
The steps to be prototyped are:
Thanks a lot for the write up! Some remarks for the dummy dataset:
I'm quite curious about the pipeline setup. Do we have to distinguish between an evaluation and a prediction pipeline? I don't think it is useful to answer that now, just collecting further directions of thinking.
Regarding the data format: https://github.com/dwadden/dygiepp looks to be already tokenized. I think this is not favorable. However, we should keep also this case in mind.
I added annotations span multiple sentences as an edge case.
[
{
"text": "A works at B.",
"id": "ABC1234",
"metadata": { "key1": "value1", "key2": 12345 },
"sentences": [ {"start": 0, "end": 13} ]
"entities": [ {"start": 0, "end": 1, "label": "PER"}, {"start": 11, "end": 12, "label": "ORG"} ],
"relations": [ {"head": 0, "tail": 1, "label": "per:employee_of"} ]
},
...
]
This looks quite good!
Could you elaborate on the predefined test split?
I think it is a common case that a dataset contains a test set and often also a validation set. The idea was just to model this in our dummy data, e.g. by having a single document dedicated to it, to increase visibility of that and to recognize any edge cases related to having multiple splits early.
Please have a look at the dataset.
I created https://github.com/ChristophAlt/pytorch-ie/pull/129 to document our decisions and to visualize how individual concepts are related and connected. @ChristophAlt please have a look.
Please have a look at the dataset.
I created a PR for that: https://github.com/ChristophAlt/pytorch-ie/pull/130
What kind of transformations do you currently require? Are these applied before you turn the HF dataset into a collection of PIE documents
I am marking my dataset with NER-tags and dependency trees, e.g. with Stanza. After that I mark specific pairings of certain entities found beforehand.
Although this could be done before turning the dataset into a PIE documents, I implemented it (and think it may be better) to happen after the conversion, as I can imagine setting up the project so that I can later enhance these annotations without deleting the old ones by using a more sophisticated method.
Sorry if I am not completely following you on the details of this, but you can ask me about my opinion anytime!
I have an initial version of a Document that is JSON-serializable and de-serializable, and thus can be persisted (and cached) as an Arrow table. See here.
It still looks a little rough around the edges but the changes can be summarized as follows:
There are still a few things missing:
I have made further changes to the initial version. See here.
The changes can be summarized as follows:
AnnotationBase
, which has a single target
attribute that references the annotation it targets (e.g. entities -> text) or None if no target is specified.List[SomeAnnotation]
has been replaced with an AnnotationList[SomeAnnotation]
, which is essentially a wrapper around a list and is responsible for setting the target attribute of an annotation when added to a document.Missing changes:
Annotation[SomeAnnotation]
, which is the equivalent of AnnotationList
but for single annotations. The dataclass annotation is then document_level_annotation: Annotation[Label] = annotation_field(target=...)
or document_level_annotation: Annotation[Label]
if no target is required.prediction
or predictions
field that holds the corresponding predictions. This would neatly tie the groundtruth annotations and predictions and simplify evaluation logic that operates on annotations directly instead of some derived targets, e.g. as done during the training stage.This looks extraordinary!
I am going to be that guy: If you want your framework to be used by people you can already think of how and where to leave helpful comments to understand what to do. I know it is pretty early, but it might alleviate a lot of pain later.
The reasoning behind this is that you are using quite elegant and sophisticated methods (e.g. dataclassed, fields): these things are bit like magic if you do not know how they work and I would definitely assume that many of the people you are targeting to use this tool will not know what they are. Although not hard to learn, it definitely is a psychological barrier (I am noticing it with myself, could be that I am exception though).
I am especially a bit confused about this as it is unclear at first what target
is and does, but (as a user) I will need to use that in my definitions.
Thanks for the valuable feedback! @GabrielKP
You're absolutely correct in your assessment. The whole concept needs much more explanation and it's best to provide examples of how to use it. The current status of the implementation is sub optimal, to say the least. 😄 I just wanted to make sure that it actually works, because there are so many things, in particular in HF datasets, that had to be adapted and extended to fit our needs. I'm extremely happy with the current proof-of-concept version --if we ignore the fact that the code is all in one file and looks crazy bad. But it does everything that is needed without making any changes to HF datasets or any external dependencies. I'm currently in the progress of implementing this properly. The most important features are:
dataset[0]
, dataset[0:2]
, and dataset[[0, 1, 2]]
return the deserialized documents. Also dataset.map(...)
gives you a document, or a list of documents if batched=True
. The result of all maps are cached, even when making changes to the actual documents!GeneratorBasedBuilder
similar to the one in HF datasets that has an additional method _generate_document
, which is responsible for converting the original datasets feature dictionary into a document, and then returns a document-based Dataset. If you want to implement a dataset that doesn't exist in HF datasets, you have to implement all the abstract functions, e.g. _generate_examples
; if there exists a HF dataset, you only have to implement _generate_document
and specify the base dataset path, e.g. "conll2003", everything is then taken care of.GeneratorBasedBuilder
script file can be uploaded to the HF hub or a Git repository and be loaded similar to the ones from HF datasets. That means, we can load our document-based datasets just like normal HF datasets with load_dataset("pie/conll2003")
and it returns a DatasetDict with document-based Datasets.This looks really promising! But I still need to read all the content thoroughly to give more detailed feedback.
@GabrielKP Just a short comment regarding
I am especially a bit confused about this as it is unclear at first what target is and does, but (as a user) I will need to use that in my definitions.
target
in this case means the target of annotation when imaging annotation layers as a graph. E.g. a span annotation is targeting text by assigning a label to a certain part of it. In the same way, a relation annotation is targeting entities by assigning a label to an ordered pair of them.
We may think about renaming this to sth like base
since target
is already a bit overloaded in the context of encode_target
in the taskmodule. However, after thinking about this a bit, it looks like a better fit here and we should rethink to use target
in the taskmodule context (encode_target etc).
This just comes to my mind: Is there any usecase/need to have multiple targets/bases for one annotation?
Many of the points discussed here are now implemented by #130.
We now have a first-class integration of HF datasets and datasets / dataset loading scripts uploaded to the model hub. It's now possible to use upload and use loading scripts that directly return Documents
, see here for an example with the CoNLL03 dataset. The corresponding dataset loading script is stored on HF dataset hub, see here.
When working with multiple documents, it possibly would be great to include some sort of abstraction for multiple documents forming a
Dataset
.The
Dataset
would then allow:@ArneBinder and I were discussing using HuggingFace Datasets as base class, which would bring the hole power of HuggingFace Datasets to pytorch-ie.