-
### Requested feature
Handling image with OCR the same way the PDF pipeline does.
What would i take to implement something like this ? Is this possible or not due to some reasons ? I can help wi…
-
-
Hey, thanks for awesome doc toolkit.
I tried to run `pdf_path = "tests/test_files/direct_extract/single_column.pdf"`
and got a following error:
```
2024-11-02 17:47:58,569 - rapid_layout - INF…
-
### Bug
In case of tables where most of the columns are empty and one column is completely filled, the table that docling extracts truncates the filled column values.
### Steps to reproduce
I ha…
-
### Description of the bug | 错误描述
在win11的docker 里安装后,运行magic-pdf -p /home/data/12_Malovichko.pdf -o /home/data/output -m auto,运行中cuda 出错。但是cuda 显示已经安装好了,不过nvcc -v出错了。
PS C:\Users\AQUANAUT> docke…
-
Develop a formatter to parse PDF and DOCX files, extract text and tables while handling complex layouts.
- [ ] Research methods of text extraction from PDF and DOCX.
- [ ] Implement Basic Parsing …
-
### Description of the bug | 错误描述
解析pdf时报错
app-1 | 2024-11-06 10:42:24.790 | INFO | magic_pdf.model.pdf_extract_kit:__call__:490 - table time: 0.0
app-1 | │ │ │ │ …
-
I am working with pdfs for some time, but recently came across tagged pdfs and I read that they have a data structure **StructTreeNode** and I want to know if you can add the support for it, ie. low l…
-
```
[](https://localhost:8080/#) in extract_data_from_pdf(pdf_path)
57 # Function to extract text using the unstructured library
58 def extract_data_from_pdf(pdf_path):
---> 59 eleme…
-
Originally opened this as a discussion, but after getting into the code, it appears to be an issue that impacts the extraction of not only tables but also images with text on them.
The problem is …