我的PDF里有一个表格，切片后原表格结构打乱了，每一行和列都没有分割标记，请问这种情况如何解决，感谢大佬解答

zhanghaotian123 commented 7 months ago

这是我的原pdf的表格内容：

这是目前chatchat项目分割后的：

我想要的切分后的效果：

dosubot[bot] commented 7 months ago

🤖

你好，@zhanghaotian123！我们又见面了！希望这次我们能一起解决这个新的挑战。生活就像编程，总是充满了惊喜和挑战，不是吗？😉

从您提供的上下文来看，Langchain-Chatchat在处理PDF文件时，使用了fitz库从pyMuPDF来处理和切割表格。它会打开PDF文件，遍历每一页，并提取文本。如果PDF中有图片，它会检查图片大小是否超过某个阈值。如果超过，它会使用get_ocr函数对图片进行OCR，并将结果添加到文本中。如果页面有旋转角度，它会在进行OCR之前旋转图片。

这是相关的代码：

def pdf2text(filepath):
    import fitz # pyMuPDF里面的fitz包，不要与pip install fitz混淆
    import numpy as np
    ocr = get_ocr()
    doc = fitz.open(filepath)
    resp = ""

    b_unit = tqdm.tqdm(total=doc.page_count, desc="RapidOCRPDFLoader context page index: 0")
    for i, page in enumerate(doc):
        b_unit.set_description("RapidOCRPDFLoader context page index: {}".format(i))
        b_unit.refresh()
        text = page.get_text("")
        resp += text + "\n"
        img_list = page.get_image_info(xrefs=True)
        for img in img_list:
            if xref := img.get("xref"):
                bbox = img["bbox"]
                # 检查图片尺寸是否超过设定的阈值
                if ((bbox[2] - bbox[0]) / (page.rect.width) < PDF_OCR_THRESHOLD[0]
                    or (bbox[3] - bbox[1]) / (page.rect.height) < PDF_OCR_THRESHOLD[1]):
                    continue
                pix = fitz.Pixmap(doc, xref)
                samples = pix.samples
                if int(page.rotation)!=0:  #如果Page有旋转角度，则旋转图片
                    img_array = np.frombuffer(pix.samples, dtype=np.uint8).reshape(pix.height, pix.width, -1)
                    tmp_img = Image.fromarray(img_array);
                    ori_img = cv2.cvtColor(np.array(tmp_img),cv2.COLOR_RGB2BGR)
                    rot_img = rotate_img(img=ori_img, angle=360-page.rotation)
                    img_array = cv2.cvtColor(rot_img, cv2.COLOR_RGB2BGR)
                else:
                    img_array = np.frombuffer(pix.samples, dtype=np.uint8).reshape(pix.height, pix.width, -1)

                result, _ = ocr(img_array)
                if result:
                    ocr_result = [line[1] for line in result]
                    resp += "\n".join(ocr_result)

        # 更新进度
        b_unit.update(1)
    return resp

源代码：

document_loaders/mypdfloader.py

然而，从上述代码中，我们并未看到特定的处理表格结构的逻辑。这可能是导致您遇到问题的原因。在处理表格结构时，可能需要使用专门的库或算法来识别和处理表格。例如，有一些库如tabula-py或camelot，它们专门用于从PDF中提取表格。

如果您有更多的信息，如file2text方法的具体实现，或者有关如何处理表格的更多细节，这可能有助于找到更好的解决方案。

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

zRzRzRzRzRzRzR commented 7 months ago

乱了是很正常的，因为是OCR识别，这个暂时没办法

chatchat-space / Langchain-Chatchat

我的PDF里有一个表格，切片后原表格结构打乱了，每一行和列都没有分割标记，请问这种情况如何解决，感谢大佬解答 #2866

Sources