ggerganov / llama.cpp

LLM inference in C/C++
MIT License
67.29k stars 9.67k forks source link

Request: Nougat OCR Integration #3294

Open OhadRubin opened 1 year ago

OhadRubin commented 1 year ago

Request: Nougat OCR Integration

I suggest adding Nougat OCR into llama.cpp to enable the processing of scientific PDF documents. This can act as a first step towards adding multimodal models to this project!

Implementation: It seems that Nougat is based on standard transformer architecture (like Bart and Swin Transformer) and most of the work would be on figuring out how to add the image processing.

Let me know what you think! P.S.: Love this repo! I hope to add my own retrieval-pretrained transformer at some point to this repo.

Fcucgvhhhvjv commented 1 year ago

Request: Nougat OCR Integration

I suggest adding Nougat OCR into llama.cpp to enable the processing of scientific PDF documents. This can act as a first step towards adding multimodal models to this project!

Implementation: It seems that Nougat is based on standard transformer architecture (like Bart and Swin Transformer) and most of the work would be on figuring out how to add the image processing.

Let me know what you think! P.S.: Love this repo! I hope to add my own retrieval-pretrained transformer at some point to this repo.

goerch commented 1 year ago

As soon as @ggerganov tackles multi-modal (not sure, maybe he did already) I'm interested. For now: not in project scope, me thinks.

ggerganov commented 1 year ago

I recently learned about this model and I am very interested in adding support for it. Not sure if llama.cpp would be the best place to do so.

It's likely to remain low prio for the near future, but if there is a community effort, I'll be happy to support it

kairan77 commented 1 year ago

Impressive results with English papers and Ebooks. Some preliminary findings on the nougat project.

1st My question: @ggerganov I have pruned out everything else apart from the inner loop (mbart + lm_head) so that the exposed api takes tensor output from swin model, and yields token_ids without post processing. If this part is rewritten in C++ to run on a 5/6 bit quantized model, based on the information we have seen, do you think the inner loop runtime could be halved? What is your best guess on the speed gain if any? 2nd Question: any pointer or code skeletons you can provide to get this going?

kairan77 commented 1 year ago

btw the encoding layers of both small and base nougat model use exactly the same swin model, the two models are only different in the underlying decoding layers [mbart]

also imho, everything before and after the inner loop is not worth rewriting in C, since they literally takes no time to run.

OhadRubin commented 10 months ago

any updates???

jpvelsamy commented 7 months ago

It would be great to have the OCR integrated into the mix. Any updates on this would be awesome

OriginalGoku commented 4 months ago

So i assume this is still not implemented?