長文本摘要中文PDF無法讀取與生成內容空白問題

感謝尹老師開源程式碼🙏 我花一些時間在自己的 Mac 筆電上測試後遇到一些問題提供以下技術細節希望可以協助解決

OS Env

M1 Mac 13.5.1 (22G90)

Python Env

# Mambaforge: https://github.com/conda-forge/miniforge#unix-like-platforms-mac-os--linux
# conda --version: 4.12.0
# create a virtual env of python 3.9 with conda
conda create -n prompt4all python=3.9 -y
conda activate prompt4all

# setup steps from project README.md
git clone https://github.com/AllanYiin/Prompt_Is_All_You_Need.git
cd Prompt_Is_All_You_Need
pip install -r requirements.txt

Runtime steps and the error of `長文本摘要`

`#1 Issue: Could not read Chinese pdf correctly`

Set OPENAI_API_KEY
Start the app: python -m prompt4all.app
Navigate to tab 長文本摘要
Upload a pdf file 03_1.pdf (感謝尹老師分享🙏）
PDF 中文內容顯示亂碼

`#2 Issue: NameError: name 'aggregate_summary' is not defined`

Click 送出 and then an error occurred (see the details below for error message traceback)

Traceback (most recent call last):
  File "/Users/martinku/mambaforge/envs/prompt4all/lib/python3.9/site-packages/gradio/routes.py", line 523, in run_predict
    output = await app.get_blocks().process_api(
  File "/Users/martinku/mambaforge/envs/prompt4all/lib/python3.9/site-packages/gradio/blocks.py", line 1437, in process_api
    result = await self.call_function(
  File "/Users/martinku/mambaforge/envs/prompt4all/lib/python3.9/site-packages/gradio/blocks.py", line 1123, in call_function
    prediction = await utils.async_iteration(iterator)
  File "/Users/martinku/mambaforge/envs/prompt4all/lib/python3.9/site-packages/gradio/utils.py", line 508, in async_iteration
    return await iterator.__anext__()
  File "/Users/martinku/mambaforge/envs/prompt4all/lib/python3.9/site-packages/gradio/utils.py", line 827, in asyncgen_wrapper
    async for response in f(*args, **kwargs):
  File "/Users/martinku/Documents/Projects/Prompt_Is_All_You_Need/prompt4all/app.py", line 222, in rolling_summary
    yield aggregate_summary(return_values), full_history
NameError: name 'aggregate_summary' is not defined

After importing the missing function, the error is gone. However, the resulting summary is empty [] （輸出文本長度為2,預計耗用tokens數為:5）

Suggested solution

經分析第一筆摘要（上述 PDF 會切成七筆），主因是 ChatGPT 回傳的摘要不是 numbered list

{
   "role":"assistant",
   "content":"摘要清單：\n- 第03章新商業智慧平台\n  - 安裝與設定\n    - 安裝SSRS 2012的前置需求\n      - 版本限制\n        - 標準版（Standard Edition）\n        - 商業智慧版（Business Intelligence Edition）\n        - 企業版（Enterprise Edition）\n        - Express版（Express Edition）\n        - 開發版（Develop Edition）\n        - 雲端服務版本（Windows Azure SQL Database）\n    - 硬體需求\n    - 作業系統與軟體需求\n    - 安裝商業智慧解決方案",
   "total_tokens":182
}

所以目前的程式碼 summary_utils.py#L89 會輸出不了任何東西

def aggregate_summary(results):
    aggs=[]
    for result in results:
        if isinstance(result,dict):
            items=[line for line in result['content'].split('\n') if is_numbered_list_member(line)]
            if all([item[:4]=="    " for item in items]):
                items=[item[4:]for item in items]
            aggs.extend(items)
        elif isinstance(result,str):
            if len(aggs) == 0:
                aggs.append(result.split('\n')[0])
            aggs.extend([c for c in result.split('\n') if c.startswith('-')])
    return aggs

暫時的 workaround (如果結果是空的就再抓一次）如下，但應該有更好的解法

diff --git a/prompt4all/utils/summary_utils.py b/prompt4all/utils/summary_utils.py
index 998e084..64620b1 100644
--- a/prompt4all/utils/summary_utils.py
+++ b/prompt4all/utils/summary_utils.py
@@ -87,6 +87,7 @@ def aggregate_summary(results):
     for result in results:
         if isinstance(result,dict):
             items=[line for line in result['content'].split('\n') if is_numbered_list_member(line)]
+            items = [line for line in result['content'].split('\n')] if len(items) == 0 else items
             if all([item[:4]=="    " for item in items]):
                 items=[item[4:]for item in items]
             aggs.extend(items)
@@ -94,6 +95,8 @@ def aggregate_summary(results):
             if len(aggs) == 0:
                 aggs.append(result.split('\n')[0])
             aggs.extend([c for c in result.split('\n') if c.startswith('-')])
+    # join all the items into a single string
+    aggs = '\n'.join(aggs)
     return aggs

⬇️ After applying the workaround

- 第03章新商業智慧平台安裝與設定
  - 安裝SSRS 2012的前置需求
    - 版本限制
      - 標準版（Standard Edition）
        - 提供報表設計、管理和部署功能
        - 不支援進階功能如Power View、資料驅動訂閱和Web Farm架構
        - 不支援PowerPivot整合模式和BISM表格式模型
        - 不支援半加成維度和資料分割
        - 處理器支援4個插槽或16個核心，記憶體最高64GB
      - 商業智慧版（Business Intelligence Edition）
        - 提供Power View、資料驅動訂閱和支援Web Farm的架構功能
        - 支援除叢集架構以外的所有功能
        - 處理器支援4個插槽或16個核心，記憶體最高64GB
      - 企業版（Enterprise Edition）
        - 支援叢集架構，處理器和記憶體無上限限制
      - Express版（Express Edition）
        - 入門級免費伺服器版本，適合建置小型應用程式
        - 處理器支援1個插槽或4個核心，記憶體最高1GB
        - 提供Express Edition with Advanced Services版本，基礎的報表設計和轉譯功能
      - 開發版（Develop Edition）
        - 支援所有企業版功能，只供開發或測試用途使用
      - 雲端服務版本（Windows Azure SQL Database）
        - 提供租賃模式的選項，不支援Analysis Services功能
        - 提供報表設計師和報表管理員功能，資料源只接受Windows Azure SQL Database
        - 延伸性和安全性受限於雲端服務平台
    - 硬體需求
      - 處理器支援4個插槽或16個核心，記憶體最高64GB
    - 作業系統與軟體需求
      - 以SQL Server 2012 SP1+ SharePoint 2013為安裝示範基準
      - 比對說明SharePoint 2010與SharePoint 2013安裝設定的異同之處
  - 安裝商業智慧解決方案
    - 介紹如何安裝Power Pivot for SharePoint整合模式和Reporting Services SharePoint整合模式
    - 設定啟用Power View和PerformancePoint Services

- Windows Azure SQL Database帳號密碼
- 表03-1：Windows Azure SQL Database Reporting比較表
- SharePoint的版本別：標準版、企業版
- 商業智慧功能需要企業版使用者端授權
- 安裝SQL Server 2012的硬體需求：處理器、記憶體、硬碟空間
- 安裝SQL Server 2012的作業系統與軟體需求：作業系統、Microsoft Windows Installer、.NET Framework、Microsoft Internet Explorer、Silverlight 5 Developer Runtime
- 安裝商業智慧解決方案的常見情境：單一機器安裝、多台機器安裝
- 安裝商業智慧解決方案的步驟：安裝SharePoint、升級SharePoint、安裝SQL Server 2012一般模式、安裝SQL Server 2012 PowerPivot for SharePoint模式、啟用SharePoint產品設定精靈、執行PowerPivot組態工具、執行SharePoint啟動精靈、啟動Reporting Services SharePoint整合模式、設定文件庫、設定PerformancePoint和Visio Services
- 安裝SharePoint的先決條件：安裝應用程式伺服器角色、網頁伺服器（IIS）角色、.NET Framework 3.5 SP1、其他相關軟體
- Windows Identity Foundation（WIF，KB974405）
- 安裝SharePoint 2013的先決條件：
  - 伺服器需要安裝應用程式伺服器角色以及網頁伺服器（IIS）角色。
  - Microsoft .NET Framework 4.5
  - Microsoft Sync Framework Runtime v1.0 SP1（x64）
  - Windows Management Framework 3.0
  - Microsoft SQL Server 2008 R2 SP1 Native Client
  - Windows Server AppFabric
  - Microsoft Identity Extensions
  - Microsoft 資訊保護與控管用戶端
  - Microsoft WCF Data Services 5.0
  - 適用於 Windows Server 的 Microsoft AppFabric 1.1 累計更新套件 1（KB2671763）
- 安裝SharePoint主程式，不勾選「立即執行SharePoint產品定精靈」
- 常發生的上傳錯誤與解決辦法：
  - 安裝程式錯誤，重新啟動電腦後繼續安裝
  - 修改註冊碼以解決錯誤
- 升級至SharePoint 2010 SP1（若安裝的是SharePoint 2013，則跳過此步驟）
- 安裝SQL Server 2012一般模式
- 選擇要安裝的SQL Server功能項目，取消勾選「Distributed Relay Controller」及「Distributed Relay Client」
- 指定執行個體的名稱，建議將一般功能安裝設定為預設執行個體
- 設定服務帳號和資料庫驗證模式
- 指定安裝的Analysis Services模式，建議將「商業智慧語意模型-表格式」設為預設的執行個體
- 在Reporting Services安裝設定中，可以同時設定原生模式和SharePoint整合模式
- 建議將原生模式設定為安裝且設定，以便同時使用兩種模式
- 安裝SQL Server 2012 PowerPivot for SharePoint模式
- 在特徵選取畫面中，選擇Analysis Services和資料庫引擎
- 安裝規則錯誤：如果之前安裝的是SharePoint 2013，需使用完整的SQL Server 2012 SP1 FullSlipstream安裝程式
- 安裝時必須使用網域帳戶
- 若使用多維度分析或資料採礦，需單獨安裝Analysis Services的多維度與資料採礦模式
- 安裝完所有SQL Server執行個體後，再安裝SQL Server 2012 SP1
- 啟用SharePoint產品設定精靈，選擇建立新的伺服器陣列
- 設定伺服器陣列需要的資料庫引擎，建議指派為之前安裝的PowerPivot整合模式的資料庫引擎執行個體
- 使用PowerPivot組態工具
- 在設定完PowerPivot for SharePoint整合模式後，出現PowerPivot Gallery網站範本
- 步驟繁瑣，需考慮安裝先後順序的問題
- SQL Server 2012中的PowerPivot組態工具能處理這些繁瑣的步驟
- PowerPivot組態工具是基於PowerShell的組態程式，內建所有設定步驟
- PowerPivot for SharePoint 2010和PowerPivot for SharePoint 2013的組態工具不同，不能混用
- 安裝SQL Server PowerPivot整合模式的執行個體時，需使用SQL Server 2012 SP1 FullSlipstream的完整安裝程式
- 若作業系統是Windows Server 2008 R2，從程式集啟動PowerPivot組態工具
- 若作業系統是Windows Server 2012，點選PowerPivot組態工具動態磚或透過搜尋開啟組態工具
- 組態工具會進行檢核，通過後點選執行即可完成所有設定步驟
- PowerPivot與Security Store Services的帳戶使用者必須一致，且必須為SharePoint的網站管理員
- 執行SharePoint啟動精靈，進入SharePoint管理中心
- 若作業系統是Windows Server 2008 R2，從程式集啟動SharePoint管理中心
- 若作業系統是Windows Server 2012，點選SharePoint管理中心動態磚或透過搜尋開啟管理中心
- 首次進入管理中心，點選啟動精靈，並選擇啟動伺服器陣列設定精靈
- 建立Reporting Services服務應用程式，開啟SharePoint管理命令介面
- 安裝Reporting Services SharePoint Service、安裝Reporting Services應用程式Proxy、啟用Reporting Services應用程式執行個體
- 開啟SharePoint管理中心，點選應用程式管理、服務應用程式、管理服務應用程式，新增SQL Server Reporting Services服務應用程式
- 設定應用程式名稱和應用程式集區名稱，指定要產生關聯的Web應用程式
- 啟動商業智慧相關網站集合功能
- 啟用Power View功能
  - SharePoint 2010: 於SharePoint的網站動作選單中點選「網站設定」
  - SharePoint 2013: 點選畫面左邊的「網站內容」，再點選畫面右方的「設定」
  - 於網站設定功能選單中，於「網站集合管理」區段中點選「網站集合功能」
- 確認商業智慧相關功能是否已正確啟動
  - Power View集合功能
  - PerformancePoint Services 網站集合功能
  - 網站集合的 PowerPivot 功能整合
- 解決Power View無法正常啟動的問題
  - 確認Power View的Web服務應用程式與Reporting Services服務應用程式產生關聯
  - 從SharePoint管理中心點選「應用程式管理」、「服務應用程式」、「管理服務應用程式」，選取Reporting Services應用程式，點選上方Ribbon的「內容」選項，開啟設定對話框
- 設定文件庫以支援Reporting Services內容類型
- 建立要用來放置報表的文件庫
  - 點入要設定新增報表項目的文件庫，按下文件庫上方選單列的「設定」、「文件庫設定值」，進入文件庫設定頁面
  - 點入「一般設定」功能區塊的「進階設定」，將「是否允許內容類型的管理?」改為「是」
  - 在「內容類型」功能區塊中，點入「從現有的網站內容類型新增」項目，選擇「報表伺服器內容類型」，將報表相關項目新增到右方方格內
- 視需求設定PerformancePoint及Visio Services
  - 啟用以下功能：BICenter資料連線功能、PerformancePoint Services網站功能、SharePoint Server企業版網站功能
  - 新增PerformancePoint相關文件庫與清單：PerformancePoint內容清單、PerformancePoint資料連線庫、儀表板庫
  - 設定PerformancePoint應用程式的執行帳戶
- 點選PerformancePoint Services應用程式
- 設定Secure Store Services的自動執行帳戶
- 完成所有商業智慧功能設定

AllanYiin / Prompt_Is_All_You_Need

長文本摘要中文PDF無法讀取與生成內容空白問題 #1

OS Env

Python Env

Runtime steps and the error of `長文本摘要`

`#1 Issue: Could not read Chinese pdf correctly`

Suggested solution

`#2 Issue: NameError: name 'aggregate_summary' is not defined`

Suggested solution

AllanYiin / Prompt_Is_All_You_Need

長文本摘要 中文PDF無法讀取與生成內容空白問題 #1

OS Env

Python Env

Runtime steps and the error of 長文本摘要

#1 Issue: Could not read Chinese pdf correctly

Suggested solution

#2 Issue: NameError: name 'aggregate_summary' is not defined

Suggested solution

長文本摘要中文PDF無法讀取與生成內容空白問題 #1

Runtime steps and the error of `長文本摘要`

`#1 Issue: Could not read Chinese pdf correctly`

`#2 Issue: NameError: name 'aggregate_summary' is not defined`