WasmEdge / WasmEdge

WasmEdge is a lightweight, high-performance, and extensible WebAssembly runtime for cloud native, edge, and decentralized applications. It powers serverless apps, embedded functions, microservices, smart contracts, and IoT devices.
https://WasmEdge.org
Apache License 2.0
8.21k stars 734 forks source link

feat : Support Document AI in WasmEdge #2356

Open sarrah-basta opened 1 year ago

sarrah-basta commented 1 year ago

Motivation

The Hugging Face Hub provides a platform hosting a collection of pre-trained models, datasets, and demos of machine learning projects. This blog by them gives a concise overview of the SOTA available models for Document AI, which includes many data science tasks from Optical Character Recognition (OCR) , Document Image Classification, Document Layout Analysis, Document Parsing and Document Visual Question Answering. WasmEdge would like to enable easy integration of these Document AI tasks in WasmEdge applications by creating the necessary pre- and post-processing functions in Rust and using the fine-tuned models available on the Hugging Face Model Hub.

Details

Document AI tasks use multimodal models, i.e models that can unify document text (using OCR) , layout (using tokens), and visual information (using spatial information from the image) in a single end-to-end framework that can learn cross-modal interactions. Each Document AI task has a description page that describes its expected output and the datasets for the task. The corresponding models fine-tuned for these datasets are available in Pytorch format, which is supported by the Wasi-NN plugin.

This project aims to

  1. Make a set of generalized pre-processing and post-processing functions that could be used for all Document AI tasks (such as tokenizers and feature extractors) and
  2. Create inference functions for each Document AI task using the said pre- and post-processing functions and fine-tuned models from the hub in the backend. Each library function takes in a media object and returns the inference result. The inference function performs the following tasks.

Milestones

sarrah-basta commented 1 year ago

Document AI Tasks

General Pre-requisites and Common Pre-processing Functions

  1. OCR Engine - Google's Tesseract The text from the image needs to be extracted and understood (in the form of words and bounding boxes indicating where those words occur) for all succeeding models, except for OCR-free Document Understanding Transformer like Donut. While other options exist, importing libtesseract, its C API into WasmEdge and calling it from Rust will be the most beneficial as further models are fine-tuned using the same engine.

  2. Tokenization - Hugging Face's Tokenizer Library The words and bounding boxes obtained from an OCR engine need to be passed to a tokenizer in order to process it further. The aim is to replicate the PreTrainedTokenizerFast class in Hugging Face, based on their tokenizer library written in Rust.

  3. General Pre- and Post-Processing functions Include all functions required to prepare input features for vision models and post process their outputs. Base on Image Processor and Feature Extractor in Hugging Face.

Document AI Tasks and Model Selection

Document AI multimodal models pre-train text, layout and image in a multi-modal framework using large-scale unlabeled scanned/digital-born documents. These models are then used in visually-rich downstream document understanding tasks by fine-tuning them on the task-respective labeled benchmark dataset. The following table outlines the tasks, datasets and corresponding models to be supported in this project.

DocumentAI Task Benchmark Dataset Required Inputs Model Expected Outputs Reference
Optical Character Recognition - Image Host function to Tesseract OCR C API Words, Bounding Boxes and Optional Tokenization -
Document Image Classification RVL-CDIP 1.pixel_values Obtained from ImageProcessor and FeatureExtractor microsoft/dit-base-finetuned-rvlcdip Predicted class with maximum score NoteBook 1 and Notebook 2
Document Layout Analysis PubLayNet To be tested nielsr/dit-document-layout-analysis Set of segmentation masks/bounding boxes, along with class names and scores Spaces and Python Script
Document Parsing FUNSD input_ids, token_type_ids, attention_mask, bbox, labels, image Obtained from OCR Engine, Tokenizer and ImageProcessor nielsr/layoutlmv2-finetuned-funsd A set of tokenized sequences and corresponding bounding boxes Notebook
Table Detection and Extraction PubTables-1M pixel_values Obtained from ImageProcessor https://huggingface.co/microsoft/table-transformer-detection A struct containing bounding box and corresponding confidence for each table detected. Notebook
Document Visual Question Answering DocVQA input_ids, token_type_ids, attention_mask, bbox, labels, image and the question Obtained from OCR Engine, Tokenizer and ImageProcessor layoutlmv3-base-mpdocvqa A single answer to the asked question Notebook

Discussion Topics

  1. I have tried selecting the best SOTA models with the finetuned weights available for each task. The LayoutLMv3 based on earlier LayoutLM structures but more efficient is one such model. However. a few places warn about its licensing issues, that I looked up in this thread. Can anyone clarify if it would be okay to use it for this project ?
  2. In some places in the Hugging Faces Processing Functions, they provide batch support. However, since WASI-NN is only to be used for inference and not training, is it okay to skip Batch Support and just pass one object at a time to the functions ?
sarrah-basta commented 1 year ago

Week 1-2 Progress Update

In order to achieve the first main goal of integrating Document AI, compiled a Rust Wrapper to Tesseract to WebAssembly using WasmEdge Plugins to perform OCR on images

Thus, the tesseract OCR can now be compiled to WASM from the rust code, the rough test code for which I have uploaded at https://github.com/sarrah-basta/wasmedge_ai_testing/blob/main/rusty-tesseract-wasm/README.md#build-instructions-to-build-the-wrapper .

Week 2-3 Plan

juntao commented 1 year ago

Thank you!

sarrah-basta commented 1 year ago

Week 3 Progress Update

In order to create the next main pre-processing block, worked on creating a tokenizer using the Rust Tokenizers Core and compiled to WebAssembly to tokenized text given by OCR

I was able to create this Rust Code compiled with WasmEdge to solve the first two parts, i.e create the correct tokenizer and obtain the encodings.

Note : I tested this by using the words obtained via OCR from the HuggingFaces ImageProcessor, this will later be replaced by wasm implementation of tesseract created earlier as well. I compared the tokens obtained from my Rust implementation and the HuggingFaces Python classes here, which mostly looks correct.

Week 4 Plan

juntao commented 1 year ago

Thank you so much for the update! I just want to clarify that you have created no additional host functions / plugins. You got the entire OCR program working inside WasmEdge (Rust compiled into Wasm). Is that correct? Thanks!

sarrah-basta commented 1 year ago

I just want to clarify that you have created no additional host functions / plugins

Yes @juntao , that's correct. While I originally thought this would be needed by leveraging the C Api of tesseract, instead, since Tesseract has a command line functionality which can be used by simply installing the pre-built binaries, I decided to leverage that instead.

entire OCR program working inside WasmEdge (Rust compiled into Wasm)

Hence, yes the entire program now works inside WasmEdge, I did however have to make the use of a plugin : wasmedge_process_interface to be able to use the Command Line functionality of the native operating system (which WasmEdge is running on) while the user's Wasm is being executed on WasmEdge .

Hope this clears the need and functioning, thank you !

P.S. Pytesseract, the python wrapper to Tesseract, used in most AI applications I am referring, also uses an identical approach.

sarrah-basta commented 1 year ago

Week 4 & 5 Progress Update

In order to create the modular end-to-end pipeline of the LayoutLMv2ModelForTokenClassification (currently using the temporary CLI based method for Tesseract OCR), created the following preprocessing functions to get the inputs required by themodel in the correct Tensor formats.

Next, I obtained the required model with the fine-tuned weights and traced it in Python to convert it to TorchScript in the function infer_layout_lmv2 . I communicated with the mentors throughout this, and while I was able to solve most of the issues I was facing, I am still facing the following errors due to which the inference in the end-to-end pipeline is remaining.

The code for these preprocessing functions is at https://github.com/sarrah-basta/wasmedge_ai_testing/tree/main/layoutlmv2_model .

Errors currently facing

[2023-04-16 20:59:08.292] [error] [WASI-NN] Only F32 inputs and outputs are supported for now.
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: NnErrno { code: 1, name: "INVALID_ARGUMENT", message: "" }', src/main.rs:107:9
stack backtrace:
[2023-04-16 20:59:08.293] [error] execution failed: unreachable, Code: 0x89
[2023-04-16 20:59:08.293] [error]     In instruction: unreachable (0x00) , Bytecode offset: 0x001a8568
[2023-04-16 20:59:08.293] [error]     When executing function name: "_start"

This error is being caused due to this check in the source code plugins/wasi-nn however all the tensors created by me for the inputs are of the correct types. Instead, what I believe is happening is that the model ony accepts inputs in the form of Pytorch LongTensors which are I32, and hence F32 FloatTensors won't work here. I even tried various things while tracing the model, such as converting inputs to FloatTensors before tracing, and noticed the arguments image and attention_mask work in any datatype, whereas input_ids, bbox and token_type_ids expect Integer values only.

Possible solutions

I am currently a little stuck and would require some guidance on how I should approach this further, is there some reason for only supporting F32 Tensors in the WASI-NN plugin for Pytorch Bckend, and if yes, is there any way to change the expectations of the TorchScript or PyTorch model ? Hopefully @juntao can give some insight.

Week 6 Plan

Hence, I have been the C API to get identical results, and while I wait for some guidance on the above issue, I will go ahead with creating a host function with the Rust plugin SDK to call functions, after registering the tesseract C API as a WasmEdge plugin (similar to https://github.com/WasmEdge/WasmEdge/blob/master/examples/plugin/get-string/getstring.cpp )

juntao commented 1 year ago

@apepkuss and @q82419 Can you comment on the issue about WASI NN not accepting f32 typed tensors? Thanks.

apepkuss commented 1 year ago

@q82419 According to the investigation by @sarrah-basta, the wasi-nn plugin has a type checking between line 623 - 627. Could you please help fix the issue? Thanks a lot!

sarrah-basta commented 1 year ago

Week 6 & 7 Progress Update

In order to create the OCR solution using the Tesseract API to avoid the CLI dependency of the command line plugin that breaks the Wasm sandbox in a very unpredictable ways, I created a :

  1. Host Function in the WasmEdge C++ SDK that uses the Tesseract C++ API , and registered it as a WasmEdge Plugin - Wasi-OCR -> https://github.com/sarrah-basta/WasmEdge/tree/wasi_ocr/plugins/wasi_ocr
  2. A Rust library crate to utilize the functions in the plugin that inputs images and gets a struct of Data containing all useful information extracted by the OCR Engined wasi-ocr -> https://github.com/sarrah-basta/wasmedge_ai_testing/tree/main/wasi-ocr

The basic flow of the created codes are as follows :

The Rust Library contains

This length is then passed to another plugin function wasi_ephemeral_ocr::get_output(output_buf:*mut c_char, output_buf_max_size: u32) -> u32; that stores the output obtained via the char *TessBaseAPI::GetTSVText API function in the output buffer. The pointer to the buffer is decoded as a &Cstr object then converted to the appropriate String format in Rust, taking care of transferring ownership to Rust in order to not face problems in the encoding.

This String is then parsed and each detection made is fed into a Data Struct containing the following fields :

pub struct Data {
    pub level: i32,
    pub page_num: i32,
    pub block_num: i32,
    pub par_num: i32,
    pub line_num: i32,
    pub word_num: i32,
    pub left: i32,
    pub top: i32,
    pub width: i32,
    pub height: i32,
    pub conf: f32,
    pub text: String,
}

A vector of such structs is returned by the public image_to_data function of the library which can be used in any downstream tasks.

I will be using it in the layoutlmv2 model created earlier.

The WasmEdge Plugin Wasi-OCR contains the two plugin functions described above and the necessary functions to register it as a module in the following file structure

The Tesseract API is created when the environment is created and destroyed at the end of the call of the image_to_data function.

Dependencies and Install Instructions for the plugin

The Tesseract API has two dependencies which can be installed as follows :

sudo apt install tesseract-ocr
sudo apt-get install libleptonica-dev

More detailed instructions can be found at https://tesseract-ocr.github.io/tessdoc/Installation.html but only the above two libraries are necessary.

They are then linked via the CMakeLists.

Building WasmEdge with the plugin

Week 8 Plan

Since the concern for the CLI depependency is now solved, this week I can focus on

These two models and the preprocessing functions already created will be used for 4 different Document AI tasks outlined in the first comment in this issue. Once the inferencing is (hopefully) successfully) done, the last 4 weeks should be spent creating the post processing functions and packaging the code written.

sarrah-basta commented 1 year ago

Week 8 Progress Update

  1. Created the end-to-end pipeline of the LayoutLMv2ModelForTokenClassification using the Wasi-OCR plugin created to obtain results from tesseract from within WasmEdge, and contains preprocessing functions to get the inputs required by the layoutlmv2 model in the correct Tensor formats. The code and detailed description of the working can be found at https://github.com/sarrah-basta/wasmedge_ai_testing/tree/main/layoutlmv2_with_wasi_ocr/README.md

    • Dependencies and Install Instructions for the plugin Needs WasmEdge to built with both Wasi-OCR and Wasi-NN with Pytorch Backend plugins.
  2. Investigated further regarding the error being caused due to this check in the source code plugins/wasi-nn .

    • Made some changes to the source code, and raised an issue with a simpler reproducible code long with a working example directy in the TorchScript API at https://github.com/WasmEdge/WasmEdge/issues/2483 .
    • Further details about this error and the work I have done regarding it so far can be found at this link .

Week 9 Plan

These two models and the preprocessing functions already created will be used for 4 different Document AI tasks outlined in the first comment in this issue. Once the inferencing is (hopefully) successfully done, the last 4 weeks should be spent creating the post processing functions and packaging the code written.

sarrah-basta commented 1 year ago

Week 9 Progress Update

  1. Tested more versions and updated the community to be able to solve the issue encountered in the Wasi-NN plugin.
  2. Traced the DiT model for Document Image Classification
  3. Added support in Wasi-NN plugin for Tuple type output to support DiT model.
  4. Tested successful inference of DiT model fine-tuned on RVL-CDIP in Wasi-NN with Dummy Inputs
  5. Created pre-processing functions for resize_image and normalize_image for DiT model.

Week 10 Plan