Annotator training mode

twinkarma commented 2 years ago

Overview

Allows the uploading of additional pre-annotated datasets for training and testing annotators before they can annotate for real. This is a common process used when recruiting annotators, for example https://crowd.cochrane.org/.

Training mode

Shows annotators the gold standard for annotating the dataset
Upload pre-annotated training dataset
Each document has an explanation as to why the label is right/wrong
Only one right answer
Shows answer/explanation immediately when choosing a label
Annotators must complete all documents within this set

Testing mode

Upload pre-annotated testing dataset
Annotators do not see answers or explanations
Only one right answer
Documents are shown to the annotators at random
Provide a summary of score at the end for annotators
Provide a summary to managers on how annotators did
Manager must then accept or reject annotators based on test results
Or this can be automated by setting a minimum score/percentage that they must achieve
Annotators must complete all documents in this set?

Labelling answers

Documents in testing and training mode will have
- Annotations and explanations provided in the document's JSON, this should be different from the annotation field used for doing actual annotation, provisionally naming it "answers" and "explanation"
- Only answers needed for multiple choice labels, not for free text
More information on annotator page of what stage they're on
Subclass Document class to separate training, testing and real annotation task set?
Record whether annotator has completed training, testing or main annotation task, better way of recording projects that annotator has worked on
- Add a data field to User, use it to track project participation?
- Many to many fields to link annotators to projects? (Call it previous_projects (for now))
- user_project( test_count, test_wrong_annotation_count, train_count, annote_count, ...)
- Keep the annotates foreign key field or use this new field to track current participation
- Update when when added to project, removed from project, when annotating something
- Exemption field per user from testing and training?
- exempt per project
- exempt for everything
- have the option to do both?
UI
- Project config, turn on test and training
- Split train, test, annotate documents into three pages

Tasks

Backend - DW
- [x] Add test_documents to Project model
- [x] Add train_document to Project model
- [x] Remove current Project - Annotators (annotators - annotates) many to one link
- [x] Add a new Project - Annotators link through an intermediate table with column {train_score, train_completed: datetime/null, test_score, test_completed:datetime/null, num_annotations, annotations_completed: datetime/null}? (or alternatively just a current_stage variable with enum of what stage the annotator's in)
- [x] Add new Project model properties {annotator_max_train_score, annotator_max_test_score, has_test_stage, has_training_stage, can_annotate_after_passing_test, min_test_pass_threshold}
- [x] Add a function to update the annotator scores and counts, should be called after every time user submits an annotation
Project Management View (Frontend) - TK
- [x] Modify project configuration to enable training and testing stage
- [x] Clearly separate documents in to testing, training and annotation sections, maybe using tabs. Should we have sub-tabs in a single document tab or put it all in the main tab list i.e. {"Project Config", "Train Docs", "Test Docs", "Docs", "Annotators", ...} - use sub-tabs.
- [x] Allows uploading and exporting of testing and training documents
- [x] Do we need annotation statistics for testing and training? I vote for no - make sure stats are only displayed for annotation stage
- [x] Show the stage the annotator's in on the Annotator management page
Annotator view (Frontend) - DW
- [x] Revise the get_annotation_task and related functions to support this new annotation workflow
- [x] Change how Annotation object is serialised so that there's enough information for the Annotation view
- [x] Show document's annotation answers in training mode
- [x] Show what stage the annotator's in
- [x] colour code the test and training phases to make it more clear
- [x] fix bug on auto approving annotators #180
Misc
- [x] Fix export tests
- [x] Fix Frontend tests (cypress)

Labelling documents with answers (gold standard)

For test and training documents, answers and explanations should be included in the JSON/csv they upload following the format below:

{
  "id": 1
  "text": "Document 1"
  "gold": 
    {
     "[labelName]": { "value":str/array, "explanation": str}
      ,...
    }
}

So in csv the columns will be

id | text | gold.[labelName].value | gold.[labelName].explanation

* Replace [labelName] with the name of the actual label specified in project configuration

twinkarma commented 2 years ago

Is there only ever one right answer?

davidwilby commented 2 years ago

@twinkarma, a few questions already:

In the above we have
- Add test_documents to Project model
- Add train_document to Project model

Hadn't we planned to have separate tables for training and test documents? We may have decided against this and I've forgotten.. So far, what I've done (in #169 ) is to create classes for training and test documents which inherit from a base document class. (Note that to do this, I've moved all the properties and methods from Document() to BaseDocument() then recreate the former as Document(BaseDocument) - since it turns out that you can't overwrite properties in child classes (e.g. the project field)

I'm not sure what your intention is with the annotator_max_train_score and annotator_max_test_score fields?
For num_annotations - would it be better to compute this on the fly? Or did we decide that this would be too slow and to update this property with each annotation?

twinkarma commented 2 years ago

Can't remember what we decided before it but I'd had a re-think about it and I'm now not sure if there is the need to increase the complexity of the app as now we'd also potentially need 3 Annotation classes for each Document class?
Now that I think about it it's just basically a count of the documents in the training set and the testing set, so those functions will just be return this.test_documents.all().count() and same for training score
I think we can just calculate this on the fly for now and change it if we are getting slowdowns. It's not something that will get called that often, only to check if a user's completed a certain stage of annotation I guess.

twinkarma commented 2 years ago

For the annotator_max_train_score and annotator_max_test_score I've just changed the field name to num_training_documents and num_test_documents, probably makes more sense this way.

davidwilby commented 2 years ago

Are we good to close this issue?

GateNLP / gate-teamware