🟢 Image to JSON Conversion Instead of ❌ PDF to JSON

AnasLahmar commented 6 months ago

I'd like to express my appreciation for the excellent work!

In the notebook title, it's mentioned as "PDF to JSON", but upon examining the content, it seems that the notebook deals with image processing.

I have a query regarding the possibility of processing multiple images to generate a single JSON file.

For instance, during my fine-tuning process, I have a PDF containing textual data (text + tables). Due to its length, I intend to split it into images (where each page of the PDF becomes an image). Then, I plan to provide these images as input, with the output being a single JSON file.

Could you provide guidance or insights into achieving this Pdf-to-JSON conversion process within the notebook?

NielsRogge commented 6 months ago

Hi,

Thanks for your interest in my notebook. That's correct, each PDF is turned into a list of image(s) which get tokenized. Idefics2 is at the time of writing one of the few open-source models which can handle examples with possible multiple images as it maps each image to either 64 or 320 (depending on whether or not one uses do_image_splitting).

The only thing you'd need to change in my notebook for PDF to JSON use cases is the collate function. It could look like this:

from transformers import AutoProcessor
from PIL import Image
import requests

processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image1 = Image.open(requests.get(url, stream=True).raw)
image2 = Image.open(requests.get(url, stream=True).raw)

# let's say we have 2 training examples (PDFs) each with a possibly varying amount of images (pages)
example_1 = ([image1, image2], "this is the ground truth for example 1")
example_2 = ([image1], "this is the ground truth for example 2")

# we can prepare them for the model as follows

texts = []
images = []
for example in [example_1, example_2]:

    images_example, ground_truth = example

    content = [{"type": "text", "text": "Extract JSON"}]
    content += [{"type": "image"} for _ in range(len(images_example))]

    # Create inputs
    messages = [
        {
            "role": "user",
            "content": content,
        },
        {
            "role": "assistant",
            "content": [
                {"type": "text", "text": ground_truth},
            ]
        },
    ]

    prompt = processor.apply_chat_template(messages, add_generation_prompt=False)
    texts.append(prompt)
    images.append(images_example)

inputs = processor(text=texts, images=images, padding=True, return_tensors="pt")

for key, value in inputs.items():
    print(key, value.shape)

cc @zucchini-nlp to confirm this is the right way to do it

NielsRogge commented 5 months ago

Notebook has now been uploaded here: https://github.com/NielsRogge/Transformers-Tutorials/blob/master/Idefics2/Fine_tune_Idefics2_for_multi_page_PDF_question_answering_on_DUDE.ipynb. Requires an A100 to be run.

Probably requires distributed training to scale this up (more pages per PDF, bigger batch size, etc.)

NielsRogge / Transformers-Tutorials

🟢 Image to JSON Conversion Instead of ❌ PDF to JSON #429