aphp / edspdf

EDS-PDF is a generic, pure-Python framework for text extraction from PDF documents. It provides the machinery to use rule- or machine-learning-based approaches to classify text blocs between body and meta-data.
https://aphp.github.io/edspdf/
BSD 3-Clause "New" or "Revised" License
41 stars 6 forks source link

Huggingface multi-modal transformers #15

Closed percevalw closed 1 year ago

percevalw commented 1 year ago

Description

This PR introduces a new HuggingfaceEmbedding component, which wraps Huggingface models (such as LayoutLM or LILT). Compared with using the raw huggingface model, this wrapper offers a simple mechanism for splitting long documents into sliding windows before sending them to the model (since the maximum number of tokens sent to the transformer is capped at 512, and sending entire sequences all at once can be memory-intensive).

Example

Here is an example of how to define a pipeline with the HuggingfaceEmbedding component:

from edspdf import Pipeline

pipeline = Pipeline()
pipeline.add_pipe("pdfminer-extractor", name="extractor")
pipeline.add_pipe(
    "huggingface-embedding",
    name="embedding",
    config={
        "model": "microsoft/layoutlmv3-base",
        "use_image": False,
        "window": 128,
        "stride": 64,
        "line_pooling": "mean",
    },
)
model.add_pipe(
    "trainable-classifier",
    name="classifier",
    config={
        "embedding": model.get_pipe("embedding"),
        "labels": [],
    },
)

This model can then be trained following the training recipe.

codecov[bot] commented 1 year ago

Codecov Report

:exclamation: No coverage uploaded for pull request base (main@9ca8fd0). Click here to learn what that means. Patch has no changes to coverable lines.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #15 +/- ## ======================================= Coverage ? 94.78% ======================================= Files ? 32 Lines ? 1974 Branches ? 0 ======================================= Hits ? 1871 Misses ? 103 Partials ? 0 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.