Closed Vidminas closed 12 months ago
To back this up, I ran a full test loading the same PDF from https://www.workerinfoexchange.org/uber-guidance-document
This is the result using torch.bfloat16
:
100%███████████████████████████████████████████████████████████████████████| 57/57 [4:38:22<00:00, 293.03s/it]
And without moving weights or inputs (default torch.float32
):
100%███████████████████████████████████████████████████████████████████████| 57/57 [22:39<00:00, 23.85s/it]
So more than a 12x speed-up.
System information:
OS Name: Microsoft Windows 10 Pro
Version: 10.0.19045 Build 19045
Processor: Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz, 2904 Mhz, 2 Core(s), 4 Logical Processor(s)
Total Physical Memory: 15.9 GB
Python: Python 3.11.4 | packaged by Anaconda, Inc. | (main, Jul 5 2023, 13:38:37) [MSC v.1916 64 bit (AMD64)] on win32
I don't know if this a general rule for CPU inference or just an edge case. But being able to set precision to use would allow handling this, whether it is an edge case or not.
Oh wow. Thanks! Let me have a look
That didn't work for me. It actually got slower by 20s/it (from 43.58 to 59.26) using 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz.
I've added the option to use full precision
I borrowed the code from
predict.py
, to write a langchain PDF loader using Nougat.Full nougat_loader.py
```python3 from functools import partial from typing import Optional, Iterator import re import torch from tqdm import tqdm from torch.utils.data import DataLoader from langchain.docstore.document import Document from langchain.document_loaders.pdf import BasePDFLoader from chatdocs.logger import logger class NougatPDFLoader(BasePDFLoader): """Load `PDF` files using Nougat (https://facebookresearch.github.io/nougat/).""" def __init__( self, file_path: str, *, num_workers: Optional[int] = 0, headers: Optional[dict] = None, ) -> None: """Initialize with file path.""" super().__init__(file_path, headers=headers) self.num_workers = num_workers try: from nougat import NougatModel from nougat.utils.checkpoint import get_checkpoint from nougat.utils.device import move_to_device except ImportError: raise ImportError( "`nougat` package not found, please install it with " "`pip install nougat-ocr`" ) checkpoint = get_checkpoint("nougat", download=True) self.model = NougatModel.from_pretrained(checkpoint) if torch.cuda.is_available(): self.batch_size = int( torch.cuda.get_device_properties(0).total_memory / 1024 / 1024 / 1000 * 0.3 ) if self.batch_size == 0: self.batch_size = 1 logger.warning("GPU VRAM is too small. Computing on CPU.") elif torch.backends.mps.is_available(): self.batch_size = 4 else: self.batch_size = 1 logger.warning("No GPU found. Conversion on CPU is very slow.") self.model = move_to_device(self.model) self.model.eval() def load(self) -> list[Document]: """Eagerly load the content.""" return list(self.lazy_load()) def lazy_load( self, ) -> Iterator[Document]: """Lazily load documents.""" import pypdf from nougat.utils.dataset import LazyDataset from nougat.postprocessing import markdown_compatible try: dataset = LazyDataset( pdf=self.file_path, prepare=partial(self.model.encoder.prepare_input, random_padding=False), ) except pypdf.errors.PdfStreamError: logger.info(f"Could not load file {str(self.file_path)}.") return dataloader = DataLoader( dataset, num_workers=self.num_workers, batch_size=self.batch_size, shuffle=False, collate_fn=LazyDataset.ignore_none_collate, ) for page_num, (sample, is_last_page) in enumerate( tqdm( dataloader, desc="Processing file {dataset.name} with {dataset.size} pages", ncols=80, position=0, leave=True, ) ): model_output = self.model.inference( image_tensors=sample, early_stopping=True ) # check if model output is faulty for i, output in enumerate(model_output["predictions"]): if output.strip() == "[MISSING_PAGE_POST]": # uncaught repetitions -- most likely empty page logger.warning(f"[MISSING_PAGE_EMPTY:{page_num}]") elif model_output["repeats"][i] is not None: if model_output["repeats"][i] > 0: # If we end up here, it means the output is most likely not complete and was truncated. logger.warning(f"Skipping page {page_num} due to repetitions.") else: # If we end up here, it means the document page is too different from the training domain. # This can happen e.g. for cover pages. logger.warning(f"[MISSING_PAGE_EMPTY:{i+1}]") else: output = markdown_compatible(output) output = re.sub(r"\n{3,}", "\n\n", output).strip() metadata = {"source": self.file_path, "page": page_num} yield Document(page_content=output, metadata=metadata) ```I'm running this on a CPU-only machine on Windows 10. By default, the model parameters and inputs are converted to
torch.bfloat16
tensors. Experimentally, it takes about 8 minutes to process 1 PDF page (the message "No GPU found. Conversion on CPU is very slow." is not kidding!)However, I found that if I run everything without converting to
torch.bfloat16
(leaving at the defaulttorch.float32
precision), it runs much faster -- ~30 seconds per PDF page. To do this, I commented out the lineself.model = move_to_device(self.model)
in my PDF loader and the lineimage_tensors = image_tensors.to(torch.bfloat16)
(plus its if condition) innougat\model.py
, class NougatModel, inference method:(which gets called by
model_output = self.model.inference(image_tensors=sample, early_stopping=True)
.It might be useful to add a parameter to model.inference to allow selecting the desired precision, and perhaps a cli flag for it too. What do you think?