facebookresearch / nougat

Implementation of Nougat Neural Optical Understanding for Academic Documents
https://facebookresearch.github.io/nougat/
MIT License
8.7k stars 555 forks source link

IndexError: list index out of range #132

Open huycke opened 11 months ago

huycke commented 11 months ago

I'm running into an issue where trying to process either an individual .pdf or a directory of .pdf files returns an Index Out of Range error message. I've tried a few things to figure this out, but can't seem to make headway. Below I copy/pasted the error message alongside the cli I ran.

(C:\Projects\Matt.conda) PS C:\Projects\libraries> nougat C:\Projects\libraries\gaming -o C:\Projects\libraries\gaming\cleaned --no-skipping -m 0.1.0-base

INFO:root:Found 42 files. C:\Projects\Matt.conda\lib\site-packages\torch\functional.py:505: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ..\aten\src\ATen\native\TensorShape.cpp:3492.) return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined] 0%| | 0/84 [00:00<?, ?it/s]ERROR:root:Invalid input type 'WindowsPath' ERROR:root:list index out of range ERROR:root:list index out of range ERROR:root:list index out of range ERROR:root:list index out of range ERROR:root:list index out of range ERROR:root:list index out of range ERROR:root:list index out of range WARNING:root:Image not found ERROR:root:list index out of range ERROR:root:list index out of range ERROR:root:list index out of range ERROR:root:list index out of range ERROR:root:list index out of range ERROR:root:list index out of range ERROR:root:list index out of range WARNING:root:Image not found ERROR:root:list index out of range ERROR:root:list index out of range ERROR:root:list index out of range ERROR:root:list index out of range ERROR:root:Invalid input type 'WindowsPath' ERROR:root:list index out of range ERROR:root:list index out of range ERROR:root:list index out of range 2%|██ | 2/84 [00:00<00:00, 1001.74it/s] ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ C:\Projects\Matt.conda\lib\runpy.py:196 in _run_module_as_main │ │ │ │ 193 │ main_globals = sys.modules["main"].dict │ │ 194 │ if alter_argv: │ │ 195 │ │ sys.argv[0] = mod_spec.origin │ │ ❱ 196 │ return _run_code(code, main_globals, None, │ │ 197 │ │ │ │ │ "main", mod_spec) │ │ 198 │ │ 199 def run_module(mod_name, init_globals=None, │ │ │ │ C:\Projects\Matt.conda\lib\runpy.py:86 in _run_code │ │ │ │ 83 │ │ │ │ │ loader = loader, │ │ 84 │ │ │ │ │ package = pkg_name, │ │ 85 │ │ │ │ │ spec = mod_spec) │ │ ❱ 86 │ exec(code, run_globals) │ │ 87 │ return run_globals │ │ 88 │ │ 89 def _run_module_code(code, init_globals=None, │ │ │ │ in :7 │ │ │ │ 4 from predict import main │ │ 5 if name == 'main': │ │ 6 │ sys.argv[0] = re.sub(r'(-script.pyw|.exe)?$', '', sys.argv[0]) │ │ ❱ 7 │ sys.exit(main()) │ │ 8 │ │ │ │ C:\Projects\Matt.conda\lib\site-packages\predict.py:156 in main │ │ │ │ 153 │ predictions = [] │ │ 154 │ file_index = 0 │ │ 155 │ page_num = 0 │ │ ❱ 156 │ for i, (sample, is_last_page) in enumerate(tqdm(dataloader)): │ │ 157 │ │ model_output = model.inference( │ │ 158 │ │ │ image_tensors=sample, early_stopping=args.skipping │ │ 159 │ │ ) │ │ │ │ C:\Projects\Matt.conda\lib\site-packages\tqdm\std.py:1178 in iter │ │ │ │ 1175 │ │ time = self._time │ │ 1176 │ │ │ │ 1177 │ │ try: │ │ ❱ 1178 │ │ │ for obj in iterable: │ │ 1179 │ │ │ │ yield obj │ │ 1180 │ │ │ │ # Update and possibly print the progressbar. │ │ 1181 │ │ │ │ # Note: does not call self.update(1) for speed optimisation. │ │ │ │ C:\Projects\Matt.conda\lib\site-packages\torch\utils\data\dataloader.py:633 in next │ │ │ │ 630 │ │ │ if self._sampler_iter is None: │ │ 631 │ │ │ │ # TODO(https://github.com/pytorch/pytorch/issues/76750) │ │ 632 │ │ │ │ self._reset() # type: ignore[call-arg] │ │ ❱ 633 │ │ │ data = self._next_data() │ │ 634 │ │ │ self._num_yielded += 1 │ │ 635 │ │ │ if self._dataset_kind == _DatasetKind.Iterable and \ │ │ 636 │ │ │ │ │ self._IterableDataset_len_called is not None and \ │ │ │ │ C:\Projects\Matt.conda\lib\site-packages\torch\utils\data\dataloader.py:677 in _next_data │ │ │ │ 674 │ │ │ 675 │ def _next_data(self): │ │ 676 │ │ index = self._next_index() # may raise StopIteration │ │ ❱ 677 │ │ data = self._dataset_fetcher.fetch(index) # may raise StopIteration │ │ 678 │ │ if self._pin_memory: │ │ 679 │ │ │ data = _utils.pin_memory.pin_memory(data, self._pin_memory_device) │ │ 680 │ │ return data │ │ │ │ C:\Projects\Matt.conda\lib\site-packages\torch\utils\data_utils\fetch.py:54 in fetch │ │ │ │ 51 │ │ │ │ data = [self.dataset[idx] for idx in possibly_batched_index] │ │ 52 │ │ else: │ │ 53 │ │ │ data = self.dataset[possibly_batched_index] │ │ ❱ 54 │ │ return self.collate_fn(data) │ │ 55 │ │ │ │ C:\Projects\Matt.conda\lib\site-packages\nougat\utils\dataset.py:114 in ignore_none_collate │ │ │ │ 111 │ │ │ │ │ _batch.append(x) │ │ 112 │ │ │ │ elif name: │ │ 113 │ │ │ │ │ if i > 0: │ │ ❱ 114 │ │ │ │ │ │ _batch[-1] = (_batch[-1][0], name) │ │ 115 │ │ │ │ │ elif len(batch) > 1: │ │ 116 │ │ │ │ │ │ _batch.append((batch[1][0] * 0, name)) │ │ 117 │ │ │ if len(_batch) == 0: │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ IndexError: list index out of range

lukas-blecher commented 11 months ago

What python version do you have installed?

huycke commented 11 months ago

I was using 3.11

Correction it's a conda env using 3.10.11

Calvinnncy97 commented 11 months ago

I also face the same issue.

lukas-blecher commented 11 months ago

Does that happen for all pdfs? can you share the pypdf and pypdfium2 versions you have installed?

huycke commented 11 months ago

Here's what I got with the pip show:

Name: pypdf Version: 3.16.1 Summary: A pure-python PDF library capable of splitting, merging, cropping, and transforming PDF files Home-page: Author: Author-email: Mathieu Fenniak biziqe@mathieu.fenniak.net License: Location: c:\projects\jeff.conda\lib\site-packages Requires: Required-by: nougat-ocr

Name: pypdfium2 Version: 3.21.1 Summary: Python bindings to PDFium Home-page: https://github.com/pypdfium2-team/pypdfium2 Author: pypdfium2-team Author-email: geisserml@gmail.com License: Apache-2.0 or BSD-3-Clause Location: c:\projects\jeff.conda\lib\site-packages Requires: Required-by: nougat-ocr, python-doctr

DogNick commented 10 months ago

I also came across this issue, any one shed some light on this ??

DogNick commented 10 months ago

Does that happen for all pdfs? can you share the pypdf and pypdfium2 versions you have installed?

Not always, but could destroy the whole task process

huycke commented 10 months ago

For me it appears to happen with any pdf. I'm using OCR'd scientific journal articles.