💫 Proposal: New JSON(L) format for training and improved training commands

Motivation

One of the biggest invonveniences and sources of frustration is spaCy's current JSON format for training. It's weirdly specific, annoying to create outside of the built-in converters and difficult to read. Training the model with incomplete information is pretty unintuitive and inconvenient as well.

To finally fix this, here's my proposal for a new and simplfied training file format that is easier to read, generate and compose.

Example

{
  "text": "Apple Inc. is an American multinational technology company headquartered in Cupertino, California. It was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne in April 1976.",
  "ents": [
    {"start": 0, "end": 10, "label": "ORG"},
    {"start": 17, "end": 25, "label": "NORP"},
    {"start": 76, "end": 85, "label": "GPE"},
    {"start": 87, "end": 97, "label": "GPE"},
    {"start": 117, "end": 127, "label": "PERSON"},
    {"start": 129, "end": 142, "label": "PERSON"},
    {"start": 148, "end": 160, "label": "PERSON"},
    {"start": 164, "end": 174, "label": "DATE"}
  ],
  "sents": [
    {"start": 0, "end": 98},
    {"start": 99, "end": 175}
  ],
  "cats": {
    "TECHNOLOGY": true,
    "FINANCE": false,
    "LEGAL": false
  },
  "tokens": [
    {"start": 0, "end": 5, "pos": "PROPN", "tag": "NNP", "dep": "compound", "head": 1},
    {"start": 6, "end": 10, "pos": "PROPN", "tag": "NNP", "dep": "nsubj", "head": 2},
    {"start": 11, "end": 13, "pos": "VERB", "tag": "VBZ", "dep": "ROOT", "head": 2},
    {"start": 14, "end": 16, "pos": "DET", "tag": "DT", "dep": "det", "head": 7},
    {"start": 17, "end": 25, "pos": "ADJ", "tag": "JJ", "dep": "amod", "head": 7},
    {"start": 26, "end": 39, "pos": "ADJ", "tag": "JJ", "dep": "amod", "head": 6},
    {"start": 40, "end": 50, "pos": "NOUN", "tag": "NN", "dep": "compound", "head": 7},
    {"start": 51, "end": 58, "pos": "NOUN", "tag": "NN", "dep": "attr", "head": 2},
    {"start": 59, "end": 72, "pos": "VERB", "tag": "VBN", "dep": "acl", "head": 7},
    {"start": 73, "end": 75, "pos": "ADP", "tag": "IN", "dep": "prep", "head": 8},
    {"start": 76, "end": 85, "pos": "PROPN", "tag": "NNP", "dep": "pobj", "head": 9},
    {"start": 85, "end": 86, "pos": "PUNCT", "tag": ",", "dep": "punct", "head": 10},
    {"start": 87, "end": 97, "pos": "PROPN", "tag": "NNP", "dep": "appos", "head": 10},
    {"start": 97, "end": 98, "pos": "PUNCT", "tag": ".", "dep": "punct", "head": 2},
    {"start": 99, "end": 101, "pos": "PRON", "tag": "PRP", "dep": "nsubjpass", "head": 16},
    {"start": 102, "end": 105, "pos": "VERB", "tag": "VBD", "dep": "auxpass", "head": 16},
    {"start": 106, "end": 113, "pos": "VERB", "tag": "VBN", "dep": "ROOT", "head": 16},
    {"start": 114, "end": 116, "pos": "ADP", "tag": "IN", "dep": "agent", "head": 16},
    {"start": 117, "end": 122, "pos": "PROPN", "tag": "NNP", "dep": "compound", "head": 19},
    {"start": 123, "end": 127, "pos": "PROPN", "tag": "NNP", "dep": "pobj", "head": 17},
    {"start": 127, "end": 128, "pos": "PUNCT", "tag": ",", "dep": "punct", "head": 19},
    {"start": 129, "end": 134, "pos": "PROPN", "tag": "NNP", "dep": "compound", "head": 22},
    {"start": 135, "end": 142, "pos": "PROPN", "tag": "NNP", "dep": "conj", "head": 19},
    {"start": 142, "end": 143, "pos": "PUNCT", "tag": ",", "dep": "punct", "head": 22},
    {"start": 144, "end": 147, "pos": "CCONJ", "tag": "CC", "dep": "cc", "head": 22},
    {"start": 148, "end": 154, "pos": "PROPN", "tag": "NNP", "dep": "compound", "head": 26},
    {"start": 155, "end": 160, "pos": "PROPN", "tag": "NNP", "dep": "conj", "head": 22},
    {"start": 161, "end": 163, "pos": "ADP", "tag": "IN", "dep": "prep", "head": 16},
    {"start": 164, "end": 169, "pos": "PROPN", "tag": "NNP", "dep": "pobj", "head": 27},
    {"start": 170, "end": 174, "pos": "NUM", "tag": "CD", "dep": "nummod", "head": 28},
    {"start": 174, "end": 175, "pos": "PUNCT", "tag": ".", "dep": "punct", "head": 16}
  ]
}

Notes

Each record contains a "text" and optional "ents" (named entity spans), "sents" (sentence spans), "cats" (text categories) and "tokens" (tokens with offsets into the text and optional attributes).
Offsets into the text are standardised: "start" (start index) and "end" (end index). Other attributes match spaCy's API.
The "tokens" don't have to include all attributes. If an attribute isn't present (e.g. a part-of-speech tag or dependency label), it's treated as a missing value.
The token "head" is the index of the head token, i.e. token.head.i.
The provided gold-standard tokenization can also be used to train the parser to split/merge tokens (coming in v2.1.x). This could be an argument / a flag to set during training.
spaCy v2.1.x (nightly) already includes a spacy.gold.docs2json helper that generates the training format from Doc objects. It's intended to help keep the converters (.conllu etc.) in sync, since they can now all produce Doc objects and call into the same helper to convert to spaCy's format. This would also make the transition to a new format easy, because we'd only have to change the logic in docs2json.

✅ Pros

Easier to read and much closer to how the linguistic annotations are presented in spaCy's data structures.
Easier to mix and match, and compose different types of data. With this format, you could easily omit the "tokens" and only train on the "ents" or update the "sents" to improve the sentence boundary detection.
Easier to generate from other sources and corpora, because there are fewer restrictions around the shape of the text. While the previous format enforced a strict separation of paragraphs and sentences, this format will let you use longer and shorter texts and define sentence boundaries within each example.
Easier to extend. If there are ever new annotations to be trained from, they can be added in a backwards-compatible way. Document-level annotations (spans like sentences or entities) at the root, and token-level annotations (other predicted attributes) within the tokens.

💡 Related ideas

Use a JSON schema to validate the training data format (!!!) and provide helpful feedback if there are problems. For example, imagine an error like: "tokens -> 20 -> start has the wrong format: integer required, received string ("5")".
Speaking of validation: We could also add more in-depth data debugging and warnings (e.g. via an optional flag or command the user can run). For example: "Your data contains a new entity type ANIMAL that currently isn't present in the model. 1) You only have 15 examples of ANIMAL. This likely isn't enough to teach the model anything meaningful about this type. 2) Your data doesn't contain any examples of texts that do not contain an entity. This will make it harder for the model to generalise and learn what's not an entity."
Make spacy train accept data in both .json and .jsonl (newline-delimited JSON). JSONL allows reading the file in line-by-line and doesn't require parsing the entire document.
spacy train should make it much easier to update existing models or, alternatively, we should provide an analogous command with the same / similar arguments that takes the name of an existing model package instead of just the language to initialize. (Basically, if you know Prodigy, we want to provide the same smooth batch training experience natively in spaCy!)
Make it easy for custom components and third-party models to hook into the training format! spaCy already supports begin_training and update methods on components (if you call nlp.update, spaCy will iterate over the components and call their update methods if available – just like nlp.from_disk). So we could, for instance, allow an _ space in the training data, just like in the custom extension attributes, that can contain additional data – think coreference annotations, entity links etc.! Those would then automatically be added to the gold-standard Doc and become available in the custom component's update method.

What do you think? I'd love to hear your feedback in the comments!

Hi Ines,

Makes perfect sense. I am bit confused though with

"cats": { "TECHNOLOGY": true, "FINANCE": false, "LEGAL": false },

Do the classification labels correspond to the entire text? In this case, you have both sentences related to technology, but what if two sentences have different labels?

Do the classification labels correspond to the entire text? In this case, you have both sentences related to technology, but what if two sentences have different labels?

Yes, the "cats" are labels that refer to the whole text, i.e. the whole document – however you define it. This is the target for spaCy to predict as doc.cats.

Sorry if this was confusing from my example, but the JSON object I've outlined above is only one entry in a list of training examples. So you could have thousands of those objects, split into sentences or larger chunks, whatever you're working with.

Speaking of validation: We could also add more in-depth data debugging and warnings (e.g. via an optional flag or command the user can run). For example: "Your data contains a new entity type ANIMAL that currently isn't present in the model. 1) You only have 15 examples of ANIMAL. This likely isn't enough to teach the model anything meaningful about this type. 2) Your data doesn't contain any examples of texts that do not contain an entity. This will make it harder for the model to generalise and learn what's not an entity."

I think this would be a killer addition. Best practices are so hard to come by if you're not an expert or they're not shoved in your face. Please call them out!

This is a very good idea, and would be much simpler.

I still don't get the expected json schema for textcat only ...

After working on this for a bit, there are some annoying problems :(. The problem is that some useful assumptions that make the data easy to work with might not hold true for all corpora. So we have to decide between convenience and fidelity.

spaCy relies on the characters in the tokens matching up with the characters in the raw text. This means we can always say which span of characters in the text corresponds to some token. The tokens can't overlap, and all characters must be either whitespace or within a token.

Not all corpora follow these assumptions. Here are some cases where the gold tokens in a corpus might differ from the raw text:

Spelling or typo normalization
Punctuation canonicalisation (e.g. the text might have ", the token might be ``). Sentence-final periods are also often duplicated in the tokens, e.g. a sentence ending on "inc." will have a token "inc." and a final period.
Fused tokens or contractions might be expanded, e.g. the text might have m while the tokens might have I am.
Disfluencies and speech repairs like "um" and "uh". The token annotations might refer to cleaned up transcripts, and the 'raw' form might be the speech recogniser output.

Issues like this come up quite a lot in syntactic annotations, as to get reasonable trees we want to correct some surface errors. For instance, if a text has "the" when the user meant "they're", we'll get a really weird syntactic structure and POS tag sequence if we don't have two tokens. We can make the tokens something like ["th", "e"]. but this is sort of unnatural...So treebanks will often just have the leaves be ["they", "'re"].

A different type of problem is that corpora often don't provide the raw text. I still don't understand why this was standard, but...Here we are. Often all we get is token annotations, and the token annotations don't allow the raw text to be reconstructed.

Together, the two problems create the following dilemmas:

We can't rely on providing character offsets for token, sentence or entity annotations, because: 1a. There might be no raw text to point into 1b. The annotations might be provided with reference to tokens, and the tokens might not align to the raw text losslessly.
We can't rely on providing token offsets for entity annotations, because the corpus might only provide character offsets, with no gold-standard tokenization.

We therefore have the following choice to make. We can either:

a) Insist that the tokens must match the text, and that the raw text always be provided. If the corpus doesn't obey these constraints already, it must be "fixed" before it can be represented in our format.

b) Allow spans (for tokens, entities, sentences etc) to provide either token offsets or character offsets (or both).

The strict approach a is better for spaCy, because the constraint matches an implementation detail of our tokenizer anyway. The permissive approach b is better for inter-operation and allows us to use the same format in more contexts, at the cost of code that uses the format being more annoying to write. For instance, to get the text of all entities, you have to check whether the offsets refer to tokens or the text, and handle the two conditions differently.

Some initial thoughts:

Internal vs. External Tokenization

You want to be able to train using data where the tokenization does not correspond to spacy's internal tokenization, but you want to insist on spacy doing its own tokenization in all models, so you have to do lots of messy alignments no matter whether token or character offsets are provided.

Converting from Existing Data

You have the issue that you want to support:

Token list-style data (CONLL-like, NER BIO, etc.)
Character offset spans (also NER and other kinds of spans)

The token list people are going to be annoyed at having to do fiddly character offset conversions with data that's not really meant for it (do you do detokenize or just have whitespace separated tokens in the raw text?), but if you don't change the format, the character offset people can't easily provide their data at all. So I guess character offsets are the more inclusive option here.

(It is weird that the current format has the token text in two places and if you create training data where the tokens and raw text don't correspond, you get None words in your training data with no warnings while training. This should probably be in a separate issue, but in any case I would not recommend having the token text duplicated anywhere.)

Character Offsets

Character offsets are an enormous pain to work with if the raw data is not set in stone. If you realize that you need to modify the text at all, you can't quickly edit the training data by hand. Everything basically goes kerplooey. Apparently this isn't a major issue for your typical use cases, but I thought I would mention it anyway.

Even in automatic conversions, it's so easy to get off by a character and not realize it (all those fiddly character offset conversions for token list data) so you'd need to build in some sanity checks that make sure you don't have a lot of tokens covering whitespace and that the non-whitespace characters are covered by the tokens like you expect.

It is too bad that it's hard to include a comment containing the covered character span alongside each span annotation with some kind of verification/lint-type tool, since this is a useful sanity check. I suppose you could add an actual string value to the token, but you'd have to make the name extremely clunky (__auto_covered_text_span__) so people don't get confused and think it's a real orth value.

Representing Document Structure / Logic for Partitioning Data

I think it's going to be difficult to envision every use case here. You could go full TEI-XML for the raw text (wouldn't that be fun! but I'm not entirely joking, see below) and you'd still have cases that aren't covered. (This is related to the discussion about docs vs. paragraphs in #4013 .)

As a kind of compromise, I would suggest adding optional string ID attributes in as many places as possible so that people have the information to reconstruct documents and make the data splits that make sense for their task. I'm imagining d1-p2-s3-t4 kind of IDs down at the token level, just d1 at the doc level, etc., some kind of hash for the sentence text on the sentence level, whatever makes sense. And possibly you'd also want something like metadata / comment for each element instead of mixing this up with IDs. I could imagine a lot of solutions, none particularly elegant, but better than not being able to link the data back to the original corpus.

Document Formatting Anecdote + Alternative

In MERLIN there were a lot of letters, which contained headers with addresses and greetings and footers with closings and signatures. We had used a custom transcription format that used whitespace to suggest the formatting for right-aligned addresses and other parts of the text. When we wanted to render the formatted texts in our online search results, just inserting space characters didn't really get the right results because of font differences. I regretted that we hadn't annotated more of the basic letter format with some TEI-like tags. PAULA XML would have allowed this with no problem, because the raw text could be XML and then you could annotate the base tokens based on the text spans. The remaining annotation layers referred to the tokens, so it was no problem to have extra characters in the raw text.

I don't think you want to base everything on tokens (or you lose the character spans above) and I think it would be a major change to allow ignored characters in the text, but it would be another option that would allow for document structure that doesn't interfere too much with spacy's training (I would imagine a typical case as HTML snippets) and additional (meta)data without relying on custom metadata packed into IDs in various ways. Most people wouldn't want to use it, but those who did might appreciate it.

(There are still some whitespace issues with things like HTML snippets, since things like <p> insert whitespace between sentences that is not necessarily present in the raw text and things like <i> don't, so it's still kind of non-trivial.)

Token IDs

If you don't think it's too much of a hassle (due to enforcing uniqueness), I would suggest making token IDs mandatory and using them instead of array offsets for the dependencies. It makes things easier to read/debug and to edit, especially as texts get longer. If you want to excerpt a single sentence you just have to adjust the character offsets (or replace everything else in the raw text with whitespace, which is totally something I've done while debugging). If you have token IDs, then you can also have the option of representing other spans as lists of tokens rather than character offsets.

To be clear, the IDs should be allowed to be arbitrary. (See: try adding a word to the beginning of a CONLL-X/U sentence.)

Normalization

I think the constraint that the token text needs to correspond to the raw text is probably a good idea to keep things simpler.

I would only recommend formally handling normalizations if spacy has a normalization component as part of the pipeline. (Otherwise, what is the model going to learn? What input is it expected to handle?) If you have meaningful IDs/metadata, you can track down the original raw text or the original normalizations as you need to for other purposes, and keep the data in the format that you intend to handle in the model in the training data.

There are some technical issues that can come up, too. The main one is that inserting multiple tokens at the same position gets tricky if you're only relying on character offsets because there's no foolproof way to represent the order of the insertions.

Format Naming and Versioning

Similar to what's happened with CONLL formats, I think you're going to regret calling everything a "spacy JSON(L) format" and relying on file endings to know how to parse things. I think it might be useful to have an optional version/format value somewhere at the top level in the JSON data.

These are very important issues to address. Adrianeboyd's observations about the value of tokens with ID values is important, especially when the data can change, or when there are complexities with normalization and/or non-alphabetic characters ("punctuation"). This is especially likely to happen when working with a corpus of conversational transcriptions, as we do. In using Spacy to tokenize and tag the Santa Barbara Corpus of Spoken American English, we had to go through many elaborate contortions to get Spacy to pay attention just to the legitimate tokens, while ignoring aspects of the transcription that do not correspond to lexical words (but which are ultimately important for studying prosody, etc.). A token id approach would make this a lot easier, and would allow Spacy to contribute its considerable value to the study of spoken language.

In spacy v3.0 the JSON format will be replaced with training from Example objects, where the reference (gold) annotation is provided directly by a Doc object instead of being read in from an alternate format. A preview of v3.0 is now available through the spacy-nightly release, here are the details for the new Example class: https://nightly.spacy.io/api/example

Any annotation you can add to a Doc can potentially be used by a component in training and prediction, including custom extensions. A corpus of Doc objects can be serialized very compactly as a DocBin, which is much much smaller on disk than the v2 JSON format.

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

explosion / spaCy