Open sarrah-basta opened 1 year ago
OCR Engine - Google's Tesseract The text from the image needs to be extracted and understood (in the form of words and bounding boxes indicating where those words occur) for all succeeding models, except for OCR-free Document Understanding Transformer like Donut. While other options exist, importing libtesseract, its C API into WasmEdge and calling it from Rust will be the most beneficial as further models are fine-tuned using the same engine.
Tokenization - Hugging Face's Tokenizer Library
The words
and bounding boxes
obtained from an OCR engine need to be passed to a tokenizer in order to process it further. The aim is to replicate the PreTrainedTokenizerFast class in Hugging Face, based on their tokenizer library written in Rust.
General Pre- and Post-Processing functions Include all functions required to prepare input features for vision models and post process their outputs. Base on Image Processor and Feature Extractor in Hugging Face.
Document AI multimodal models pre-train text, layout and image in a multi-modal framework using large-scale unlabeled scanned/digital-born documents. These models are then used in visually-rich downstream document understanding tasks by fine-tuning them on the task-respective labeled benchmark dataset. The following table outlines the tasks, datasets and corresponding models to be supported in this project.
DocumentAI Task | Benchmark Dataset | Required Inputs | Model | Expected Outputs | Reference |
---|---|---|---|---|---|
Optical Character Recognition | - | Image | Host function to Tesseract OCR C API | Words, Bounding Boxes and Optional Tokenization | - |
Document Image Classification | RVL-CDIP | 1.pixel_values Obtained from ImageProcessor and FeatureExtractor |
microsoft/dit-base-finetuned-rvlcdip | Predicted class with maximum score | NoteBook 1 and Notebook 2 |
Document Layout Analysis | PubLayNet | To be tested | nielsr/dit-document-layout-analysis | Set of segmentation masks/bounding boxes, along with class names and scores | Spaces and Python Script |
Document Parsing | FUNSD | input_ids, token_type_ids, attention_mask, bbox, labels, image Obtained from OCR Engine, Tokenizer and ImageProcessor |
nielsr/layoutlmv2-finetuned-funsd | A set of tokenized sequences and corresponding bounding boxes | Notebook |
Table Detection and Extraction | PubTables-1M | pixel_values Obtained from ImageProcessor |
https://huggingface.co/microsoft/table-transformer-detection | A struct containing bounding box and corresponding confidence for each table detected. | Notebook |
Document Visual Question Answering | DocVQA | input_ids, token_type_ids, attention_mask, bbox, labels, image and the question Obtained from OCR Engine, Tokenizer and ImageProcessor |
layoutlmv3-base-mpdocvqa | A single answer to the asked question | Notebook |
In order to achieve the first main goal of integrating Document AI, compiled a Rust Wrapper to Tesseract to WebAssembly using WasmEdge Plugins to perform OCR on images
sudo apt install tesseract-ocr
on linux.Thus, the tesseract OCR can now be compiled to WASM from the rust code, the rough test code for which I have uploaded at https://github.com/sarrah-basta/wasmedge_ai_testing/blob/main/rusty-tesseract-wasm/README.md#build-instructions-to-build-the-wrapper .
words
and bounding boxes
given by the above OCR in a complete model and test a model that depends on tesseract working which would allow us to discover most of the potential problems we will have with other models.Thank you!
In order to create the next main pre-processing block, worked on creating a tokenizer using the Rust Tokenizers Core and compiled to WebAssembly to tokenized text given by OCR
I was able to create this Rust Code compiled with WasmEdge to solve the first two parts, i.e create the correct tokenizer and obtain the encodings.
Note : I tested this by using the words obtained via OCR from the HuggingFaces ImageProcessor, this will later be replaced by wasm implementation of tesseract created earlier as well. I compared the tokens obtained from my Rust implementation and the HuggingFaces Python classes here, which mostly looks correct.
The next steps will be to create the end-to-end pipeline for the Sequence Labelling/Document Parsingtask, which is what I was referring so far. This will include a. Integrating the rusty-tesseract-wasm OCR b. Converting the obtained encodings to the correct tensor formats c. Inferencing the tensors with the PyTorch model using the Wasi-NN plugin
Once this proof of concept is complete, I will be able to a. Clean the code and add input checks, OS checks (for running CLI tesseract commands), etc and b. Divide the code into modular functions.
Thank you so much for the update! I just want to clarify that you have created no additional host functions / plugins. You got the entire OCR program working inside WasmEdge (Rust compiled into Wasm). Is that correct? Thanks!
I just want to clarify that you have created no additional host functions / plugins
Yes @juntao , that's correct. While I originally thought this would be needed by leveraging the C Api of tesseract, instead, since Tesseract has a command line functionality which can be used by simply installing the pre-built binaries, I decided to leverage that instead.
entire OCR program working inside WasmEdge (Rust compiled into Wasm)
Hence, yes the entire program now works inside WasmEdge, I did however have to make the use of a plugin : wasmedge_process_interface to be able to use the Command Line functionality of the native operating system (which WasmEdge is running on) while the user's Wasm is being executed on WasmEdge .
Hope this clears the need and functioning, thank you !
P.S. Pytesseract, the python wrapper to Tesseract, used in most AI applications I am referring, also uses an identical approach.
In order to create the modular end-to-end pipeline of the LayoutLMv2ModelForTokenClassification (currently using the temporary CLI based method for Tesseract OCR), created the following preprocessing functions to get the inputs required by themodel in the correct Tensor formats.
(words, boxes) = apply_tesseract(image_name, image_width, image_height)
: applies the tesseract OCR using the wrapper to the CLI functionality of the Tesseract OCR engine and parses the output to return the Vectors of the words and bounding boxes obtained.base_encodings = layoutlmv2_tokenizer(words)
: This function creates a "Fast" tokenizer using the Rust Core library and encodes the words obtained from OCR using the Rust function, making the work done in Week 3 into a modular fashion. bboxes = encoded_boxes(&base_encodings, boxes)
: This creates the bboxes in the format needed by the model, using the ids from the encodings created by the tokenizer and the boxes created by the OCR.resize_image
and to_bgr_image
: Basic image processing functions that convert the image to a format required by the model.encoded_boxes vector
input_ids
, attention_mask
and token_type_ids
required by the modelf32_to_tensor_data
: That takes in a Vecto_tensor
: That takes in the tensor_data and converts it into a wasi_nn::Tensor
Next, I obtained the required model with the fine-tuned weights and traced it in Python to convert it to TorchScript in the function infer_layout_lmv2
. I communicated with the mentors throughout this, and while I was able to solve most of the issues I was facing, I am still facing the following errors due to which the inference in the end-to-end pipeline is remaining.
The code for these preprocessing functions is at https://github.com/sarrah-basta/wasmedge_ai_testing/tree/main/layoutlmv2_model .
[2023-04-16 20:59:08.292] [error] [WASI-NN] Only F32 inputs and outputs are supported for now.
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: NnErrno { code: 1, name: "INVALID_ARGUMENT", message: "" }', src/main.rs:107:9
stack backtrace:
[2023-04-16 20:59:08.293] [error] execution failed: unreachable, Code: 0x89
[2023-04-16 20:59:08.293] [error] In instruction: unreachable (0x00) , Bytecode offset: 0x001a8568
[2023-04-16 20:59:08.293] [error] When executing function name: "_start"
This error is being caused due to this check in the source code plugins/wasi-nn however all the tensors created by me for the inputs are of the correct types.
Instead, what I believe is happening is that the model ony accepts inputs in the form of Pytorch LongTensors
which are I32, and hence F32 FloatTensors
won't work here. I even tried various things while tracing the model, such as converting inputs to FloatTensors before tracing, and noticed the arguments image
and attention_mask
work in any datatype, whereas input_ids
, bbox
and token_type_ids
expect Integer values only.
I am currently a little stuck and would require some guidance on how I should approach this further, is there some reason for only supporting F32 Tensors in the WASI-NN plugin for Pytorch Bckend, and if yes, is there any way to change the expectations of the TorchScript or PyTorch model ? Hopefully @juntao can give some insight.
Hence, I have been the C API to get identical results, and while I wait for some guidance on the above issue, I will go ahead with creating a host function with the Rust plugin SDK to call functions, after registering the tesseract C API as a WasmEdge plugin (similar to https://github.com/WasmEdge/WasmEdge/blob/master/examples/plugin/get-string/getstring.cpp )
@apepkuss and @q82419 Can you comment on the issue about WASI NN not accepting f32 typed tensors? Thanks.
@q82419 According to the investigation by @sarrah-basta, the wasi-nn
plugin has a type checking between line 623 - 627. Could you please help fix the issue? Thanks a lot!
In order to create the OCR solution using the Tesseract API to avoid the CLI dependency of the command line plugin that breaks the Wasm sandbox in a very unpredictable ways, I created a :
Data
containing all useful information extracted by the OCR Engined wasi-ocr -> https://github.com/sarrah-basta/wasmedge_ai_testing/tree/main/wasi-ocr The basic flow of the created codes are as follows :
The Rust Library contains
image_to_data(image_path : &str) -> Vec<Data>
--> This function uses two private functions in the library crate to take in the image path convert into a CString, so that it can be read by the C++ plugin, and uses the plugin function wasi_ephemeral_ocr::num_of_extractions(image_path: *const c_char, image_len: u32) -> u32
that returns the length of the buffer required to store the output TSV Text obtained by Tesseract. This length is then passed to another plugin function wasi_ephemeral_ocr::get_output(output_buf:*mut c_char, output_buf_max_size: u32) -> u32;
that stores the output obtained via the char *TessBaseAPI::GetTSVText API function in the output buffer. The pointer to the buffer is decoded as a &Cstr object then converted to the appropriate String format in Rust, taking care of transferring ownership to Rust in order to not face problems in the encoding.
This String is then parsed and each detection made is fed into a Data
Struct containing the following fields :
pub struct Data {
pub level: i32,
pub page_num: i32,
pub block_num: i32,
pub par_num: i32,
pub line_num: i32,
pub word_num: i32,
pub left: i32,
pub top: i32,
pub width: i32,
pub height: i32,
pub conf: f32,
pub text: String,
}
A vector of such structs is returned by the public image_to_data
function of the library which can be used in any downstream tasks.
I will be using it in the layoutlmv2 model created earlier.
The WasmEdge Plugin Wasi-OCR contains the two plugin functions described above and the necessary functions to register it as a module in the following file structure
The Tesseract API is created when the environment is created and destroyed at the end of the call of the image_to_data
function.
The Tesseract API has two dependencies which can be installed as follows :
sudo apt install tesseract-ocr
sudo apt-get install libleptonica-dev
More detailed instructions can be found at https://tesseract-ocr.github.io/tessdoc/Installation.html but only the above two libraries are necessary.
They are then linked via the CMakeLists.
-DWASMEDGE_PLUGIN_WASI_OCR=On
flag along with building as usual following a process similar to https://wasmedge.org/book/en/contribute/build_from_src/plugin_wasi_nn.html Since the concern for the CLI depependency is now solved, this week I can focus on
These two models and the preprocessing functions already created will be used for 4 different Document AI tasks outlined in the first comment in this issue. Once the inferencing is (hopefully) successfully) done, the last 4 weeks should be spent creating the post processing functions and packaging the code written.
Created the end-to-end pipeline of the LayoutLMv2ModelForTokenClassification using the Wasi-OCR plugin created to obtain results from tesseract from within WasmEdge, and contains preprocessing functions to get the inputs required by the layoutlmv2 model in the correct Tensor formats. The code and detailed description of the working can be found at https://github.com/sarrah-basta/wasmedge_ai_testing/tree/main/layoutlmv2_with_wasi_ocr/README.md
Wasi-OCR
and Wasi-NN with Pytorch Backend
plugins.Investigated further regarding the error being caused due to this check in the source code plugins/wasi-nn .
These two models and the preprocessing functions already created will be used for 4 different Document AI tasks outlined in the first comment in this issue. Once the inferencing is (hopefully) successfully done, the last 4 weeks should be spent creating the post processing functions and packaging the code written.
Tuple
type output to support DiT model.resize_image
and normalize_image
for DiT model.
Motivation
The Hugging Face Hub provides a platform hosting a collection of pre-trained models, datasets, and demos of machine learning projects. This blog by them gives a concise overview of the SOTA available models for Document AI, which includes many data science tasks from Optical Character Recognition (OCR) , Document Image Classification, Document Layout Analysis, Document Parsing and Document Visual Question Answering. WasmEdge would like to enable easy integration of these Document AI tasks in WasmEdge applications by creating the necessary pre- and post-processing functions in Rust and using the fine-tuned models available on the Hugging Face Model Hub.
Details
Document AI tasks use multimodal models, i.e models that can unify document text (using OCR) , layout (using tokens), and visual information (using spatial information from the image) in a single end-to-end framework that can learn cross-modal interactions. Each Document AI task has a description page that describes its expected output and the datasets for the task. The corresponding models fine-tuned for these datasets are available in Pytorch format, which is supported by the Wasi-NN plugin.
This project aims to
Milestones