LinWeizheDragon / FLMR

The huggingface implementation of Fine-grained Late-interaction Multi-modal Retriever.
66 stars 4 forks source link

For a custom document, how can I support the input of multiple images? #14

Closed yuejunpeng closed 4 months ago

yuejunpeng commented 4 months ago

I followed the instructions about custom document in the readme. `## Create document collections num_items = 100
feature_dim = 1664

document_items = []
document_items.extend(query_memories)
document_items.extend(random.sample(query_scene_memory, 15))
document_items.extend(random.sample(memory_items, 80))
assert len(document_items) == num_items

passage_contents = []
image_paths = []
for i, document_item in enumerate(document_items):
    passage_contents.append(
        f"Instruction: {document_item['subtaskInstruction']}, captioning: {document_item['detail']['beginCaption']}")
    image_paths.append(document_item['detail']['beginPath'])
    print(f"{i}: {document_item}")

custom_collection = [
    (passage_content, None, image_path)
    for passage_content, image_path in zip(passage_contents, image_paths)
]`

However, if it can support the input of multiple images, it would be more suitable for me. For the document, each item includes a text content and multiple images. For the query, each item includes a text content and a image. Is this possible? If so, how should it be modified? Thank you sincerely!

LinWeizheDragon commented 4 months ago

There are two approaches that can address this case:

  1. Modify the model file to support multiple doc images. Specifically, you will need to modify the .doc(..) function of FLMRModelForRetrieval, and change the pixel_values to support multiple images per doc. And you need to modify the function in FLMRModelForIndexing to pass in multiple images per document to the model.
  2. An alternative approach is to adhere to one image per document. If a document has 5 images, you can split it into 5 documents each having one image. This increases the corpus size, but offers easier solutions.

You can make your choice based on the trade-off. Just to mention, it would be better if the model can be finetuned on Image+Text->Image+Text. This use may be sub-optimal.

We plan to release a finetuning script very soon.