Closed AnasLahmar closed 5 months ago
Hi,
Thanks for your interest in my notebook. That's correct, each PDF is turned into a list of image(s) which get tokenized. Idefics2 is at the time of writing one of the few open-source models which can handle examples with possible multiple images as it maps each image to either 64 or 320 (depending on whether or not one uses do_image_splitting
).
The only thing you'd need to change in my notebook for PDF to JSON use cases is the collate function. It could look like this:
from transformers import AutoProcessor
from PIL import Image
import requests
processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b")
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image1 = Image.open(requests.get(url, stream=True).raw)
image2 = Image.open(requests.get(url, stream=True).raw)
# let's say we have 2 training examples (PDFs) each with a possibly varying amount of images (pages)
example_1 = ([image1, image2], "this is the ground truth for example 1")
example_2 = ([image1], "this is the ground truth for example 2")
# we can prepare them for the model as follows
texts = []
images = []
for example in [example_1, example_2]:
images_example, ground_truth = example
content = [{"type": "text", "text": "Extract JSON"}]
content += [{"type": "image"} for _ in range(len(images_example))]
# Create inputs
messages = [
{
"role": "user",
"content": content,
},
{
"role": "assistant",
"content": [
{"type": "text", "text": ground_truth},
]
},
]
prompt = processor.apply_chat_template(messages, add_generation_prompt=False)
texts.append(prompt)
images.append(images_example)
inputs = processor(text=texts, images=images, padding=True, return_tensors="pt")
for key, value in inputs.items():
print(key, value.shape)
cc @zucchini-nlp to confirm this is the right way to do it
Notebook has now been uploaded here: https://github.com/NielsRogge/Transformers-Tutorials/blob/master/Idefics2/Fine_tune_Idefics2_for_multi_page_PDF_question_answering_on_DUDE.ipynb. Requires an A100 to be run.
Probably requires distributed training to scale this up (more pages per PDF, bigger batch size, etc.)
I'd like to express my appreciation for the excellent work!
In the notebook title, it's mentioned as "PDF to JSON", but upon examining the content, it seems that the notebook deals with image processing.
I have a query regarding the possibility of processing multiple images to generate a single JSON file.
For instance, during my fine-tuning process, I have a PDF containing textual data (text + tables). Due to its length, I intend to split it into images (where each page of the PDF becomes an image). Then, I plan to provide these images as input, with the output being a single JSON file.
Could you provide guidance or insights into achieving this Pdf-to-JSON conversion process within the notebook?