qianma819 commented 4 months ago

匹配结果中包含文件，点击文件后，自动下载。下载后再打开比较麻烦，怎样可以做到点击匹配结果中的pdf文件，立马在浏览器中打开进行预览

dosubot[bot] commented 4 months ago

To configure the Langchain-Chatchat application to preview PDF files in the browser directly from the match results instead of downloading them, you need to integrate a PDF viewer component into your Streamlit application. The current code snippet you provided shows functionality for downloading selected documents but does not include a PDF viewer integration.

Here's a way to integrate PDF.js, a popular PDF viewer, into your Streamlit application for inline PDF previews:

Download and Host PDF.js: First, you need to download PDF.js from its GitHub repository and host it on a server accessible by your Streamlit application. This is necessary because you'll need to reference the PDF.js viewer's HTML file in an iframe.
Modify the knowledge_base_page Function: In the section where you handle the selection of documents and provide a download button, you can add an additional step to display the selected PDF file in an iframe using PDF.js. You'll use Streamlit's components.v1.html function to achieve this.

Here's an example modification you can make to the knowledge_base_page function to include an inline PDF viewer:

from streamlit.components.v1 import html

# Assuming you've hosted PDF.js at 'http://yourserver.com/pdfjs/web/viewer.html'
PDFJS_VIEWER_URL = 'http://yourserver.com/pdfjs/web/viewer.html'

def knowledge_base_page(api: ApiRequest, is_lite: bool = None):
    # Existing code...

    cols = st.columns(4)
    file_name, file_path = file_exists(kb, selected_rows)
    if file_path:
        # Existing download button code...

        # New: Embed PDF in an iframe using PDF.js
        pdf_viewer_url = f"{PDFJS_VIEWER_URL}?file={file_path}"
        iframe_code = f'<iframe src="{pdf_viewer_url}" width="100%" height="600"></iframe>'
        st.markdown(iframe_code, unsafe_allow_html=True)

    # Rest of the existing code...

Important Considerations:

Hosting PDF.js: You must host PDF.js on a server accessible by your Streamlit application. The example URL http://yourserver.com/pdfjs/web/viewer.html is a placeholder and should be replaced with the actual URL where you've hosted PDF.js.
File Path: The file_path variable contains the path to the PDF file on the server. For security and functionality reasons, you might need to adjust how you serve these PDF files to make them accessible via URL to the PDF.js viewer. This often involves serving the files through a web server or integrating with cloud storage solutions that can provide direct URLs to the files.
Cross-Origin Requests: If your Streamlit application and the PDF.js viewer are hosted on different domains, you might encounter cross-origin request issues (CORS). You'll need to ensure proper CORS configuration on your server to allow these requests.

This solution provides a basic integration for inline PDF previews. Depending on your specific requirements and setup, you may need to adjust the implementation details.

Sources

#### About Dosu

This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.

HuntZhaozq commented 4 months ago

请问除了这个方法还有别的设置可以进行问答的出处文件点击后在浏览器内预览吗？ @dosu-bot

qianma819 commented 4 months ago

if os.path.exists(kb_file.filepath): print("ppppppppppppppppp") print(kb_file.filename) response = FileResponse( path=kb_file.filepath, media_type="application/pdf", filename=kb_file.filename,

media_type="multipart/form-data",

            #content_disposition_type=content_disposition_type,
            #headers={"Content-Type":"application/pdf"}
        )
        response.headers["Content-Disposition"]=f"inline;filename={kb_file.filename}"
        return response

这样修改后pdf文件可以在线预览了

HuntZhaozq commented 4 months ago

@qianma819 会出现error: latin-1 codec can't encode characters in posistion 16-24: ordinal not in range(256)，还有请问这个设置能预览doc吗

qianma819 commented 4 months ago

doc没试过，因为我们所有资料都转成了pdf。doc导入有时候都会失败，你可以试试

HuntZhaozq commented 4 months ago

修改后，有的pdf可以直接预览了，但是有的pdf还是会直接下载，请问你有遇到这个情况吗？应该如何解决？ @qianma819

qianma819 commented 4 months ago

这个情况没遇到，遇到的是 { "code": 500, "msg": "【ポスコン】KnowHow共有内容一覧11.pdf 读取文件失败，错误信息是：'latin-1' codec can't encode characters in position 16-21: ordinal not in range(256)", "data": null }这样的错误。但是直接访问原文件却没有问题。蛋疼

HuntZhaozq commented 4 months ago

这个问题是因为header，可以把这个response.headers["Content-Disposition"]=f"inline;filename={kb_file.filename}"去掉

qianma819 commented 4 months ago

什么？去掉了我们就只能下载无法在线预览了呀

HuntZhaozq commented 4 months ago

可以试试这个

    preview = True
    if preview:
        # Determine media type and content disposition based on file extension
        file_extension = os.path.splitext(file_name)[1].lower()
        media_type = None
        if file_extension == '.pdf':
            media_type = 'application/pdf'
            content_disposition_type = 'inline'
        elif file_extension in ['.doc', '.docx']:
            # For DOC files, you can choose either of the following media types:
            # media_type = 'application/msword'
            media_type = 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'
            content_disposition_type = 'inline'
        else:
            # For other file types, force download
            content_disposition_type = None
    else:
        content_disposition_type = None

    try:
        kb_file = KnowledgeFile(filename=file_name,
                                knowledge_base_name=knowledge_base_name)

        if os.path.exists(kb_file.filepath):
            headers = {
                "Content-Type": media_type,
                "Accept-Ranges": "none",  # Disable range requests
            }

            return FileResponse(
                path=kb_file.filepath,
                filename=kb_file.filename,
                media_type=media_type,
                content_disposition_type=content_disposition_type,
                headers=headers,
            )

qianma819 commented 4 months ago

@HuntZhaozq 这样改确实没有那个问题了

HuntZhaozq commented 4 months ago

但是有的pdf还是不能预览，要下载，大部分是可以预览的

qianma819 commented 4 months ago

什么样的pdf不能预览？我这边测试没遇到

HuntZhaozq commented 4 months ago

知网下载的一小部分，大部分都可以预览

qianma819 commented 4 months ago

奥，那不管了，凑合用吧。反正下载了也能看

HuntZhaozq commented 4 months ago

还有个问题，在本机可以下载文件，但是同一网段下的其他机器就下载不了，请问你遇到过吗

qianma819 commented 4 months ago

那我是配置在服务器然后去访问的，都能预览。

HuntZhaozq commented 4 months ago

我也是配置在服务器，但是文件前缀是127.0.0.1访问不到，你有修改那些配置吗？

qianma819 commented 4 months ago

DEFAULT_BIND_HOST 修改成实际ip，就不用127.0.0.1访问了

qianma819 commented 4 months ago

@HuntZhaozq if file_extension == '.txt': media_type = 'text/plain' content_disposition_type = 'inline' txt文件预览中文乱码，你知道怎样修改不

HuntZhaozq commented 4 months ago

可以问问chatgpt，你的word文档可以预览嘛？我按这个来word还是会下载

qianma819 commented 4 months ago

我没有word文档，原先只有pdf，后面增加了txt

qianma819 commented 4 months ago

elif file_extension in ['.doc', '.docx']:
if file_extension == '.doc':
media_type = 'application/msword'
else:
media_type = 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'
doc文件跟docx文件是不一样的

Lj4040 commented 2 weeks ago

我的是匹配结果下载不了，点击就是127.0.0.1拒绝访问，如何解决？大佬们

chatchat-space / Langchain-Chatchat

匹配结果中的文件直接在线预览 #3220

Sources

media_type="multipart/form-data",