climatepolicyradar / navigator

Policy navigator
BSD 3-Clause "New" or "Revised" License
4 stars 0 forks source link

Understand the breakdown of layouts in the PDF corpus #17

Closed kdutia closed 2 years ago

kdutia commented 2 years ago

How does the corpus break down in terms of:

--

kdutia commented 2 years ago

Clustering of VGG16 embeddings (a general-purpose model) hasn't been successful - the clusters produced by both kmeans and gaussian mixture models don't successfully separate layouts, even at a column level.

See notebook here.

I can't find a finetuned document image classification model anywhere, so my next step is to try clustering embeddings from LayoutLMV2, which are fine-tuned on documents and contain positional text embeddings, as follows: