Open shadow-of-Darkness opened 1 week ago
Cool, in my opinion, MinerU is a brilliant project, but it has a shortcoming that needs about 8G of GPU memory. Perhaps someone just has a single CPU. According to you, recognizing it as an option to select is a good idea. If you are interesting in MinerU, you can try it and pull request to me.
This project uses RapidOCR for image OCR and Fitz in the PyMuPDF package for PDF OCR. To be honest, it is extremely difficult to recognize tables in some PDFs, especially in scholarly papers. Therefore, I have a suggestion for you: consider using MinerU as the PDF OCR tool. MinerU(https://github.com/opendatalab/MinerU) is an open-source project that can transform PDFs into a data format which greatly aids in content extraction.You can recognize it as an option to select