Understand the breakdown of layouts in the PDF corpus

Clustering of VGG16 embeddings (a general-purpose model) hasn't been successful - the clusters produced by both kmeans and gaussian mixture models don't successfully separate layouts, even at a column level.

See notebook here.

I can't find a finetuned document image classification model anywhere, so my next step is to try clustering embeddings from LayoutLMV2, which are fine-tuned on documents and contain positional text embeddings, as follows:

visual embeddings only: the average-pooled initial visual embeddings concatenated with the average-pooled final visual embeddings
visual embeddings plus positional LM embeddings: If this proves slow, I'll try LayoutLMV2 with just the visual embeddings. the concatenation of the final hidden state of the [CLS] token, average-pooled initial visual embeddings and average-pooled final visual embeddings

climatepolicyradar / navigator

Understand the breakdown of layouts in the PDF corpus #17