catalyst-cooperative / mozilla-sec-eia

Exploratory development for SEC to EIA linkage
MIT License
0 stars 0 forks source link

Handle very long documents with LayoutLM #47

Closed katie-lamb closed 1 month ago

katie-lamb commented 4 months ago

Overview

As with many image extraction models, LayoutLM handles a maximum token length of 512. This means that with very long Ex. 21 filings it's possible to exceed the 512 token limit and the rest of the document will be truncated. This problem is discussed in this issue. It's probably okay in theory to just truncate the docs during training, but this becomes problematic during inference, when we want to predict on the whole document.

This notebook provides a helpful example of the solution. It says:

"To deal with this, we will allow one (long) example in our dataset to give several input features, each of length shorter than the maximum length of the model (or the one we set as a hyper-parameter). Also, just in case the answer lies at the point we split a long context, we allow some overlap between the features we generate controlled by the hyper-parameter doc_stride"

Additionally, the offset_mapping parameter allows one to map from tokens back to the original text.

Success Criteria

### Next steps
- [x] Follow steps in the linked notebook
- [x] Implement in training and inference steps (might just need to be changed in notebooks where the processor is actually created)
- [x] Experiment with overflow_to_sample_mapping parameter and then encoding.pop('overflow_to_sample_mapping')