Open Batapha opened 1 week ago
增补一个案例,英文文档识别过程应为逐行扫描,但是识别的结果发生了不同行之间错乱
在下面的截图中可以看到,识别出来的markdown文件,第4行和第1段的最后一行整合在一起,第2、3行又另起一段
通过设置--max_pages 为9999强制全部页面识别,但是只能通过marker_single来运行,所以编写了个自动化处理脚本.sh,代码如下:
同时可以通过设置batch_multiplier的大小来实验GPU的占用率,防止爆缓存
`#!/bin/bash
INPUT_DIR="/Users/User/Documents/Github/maker/Input" OUTPUT_DIR="/Users/user/Documents/Github/maker/Output"
BATCH_MULTIPLIER=${1:-1}
if ! [[ "$BATCH_MULTIPLIER" =~ ^[0-9]+$ ]]; then echo "Error: batch_multiplier must be a number" exit 1 fi
if [ "$BATCH_MULTIPLIER" -lt 1 ] || [ "$BATCH_MULTIPLIER" -gt 2 ]; then echo "Error: batch_multiplier must be between 1 and 2" exit 1 fi
if [ ! -d "$INPUT_DIR" ]; then echo "Error: Input directory '$INPUT_DIR' does not exist" exit 1 fi
mkdir -p "$OUTPUT_DIR"
total_files=$(ls -1 "$INPUT_DIR"/*.pdf 2>/dev/null | wc -l) current=0
echo "Starting processing with batch_multiplier = $BATCH_MULTIPLIER"
for file in "$INPUT_DIR"/*.pdf; do if [ -f "$file" ]; then current=$((current + 1)) filename=$(basename "$file") echo "Processing ($current/$total_files): $filename"
marker_single "$file" "$OUTPUT_DIR" --max_pages 9999 --batch_multiplier "$BATCH_MULTIPLIER"
fi
done
echo "All PDF files have been processed!" `
脚本运行方式
首次使用前,需要添加执行权限: chmod +x process_pdfs.sh
运行脚本的两种方式: 方式一(需要执行权限):./process_pdfs.sh 2 方式二(不需要执行权限):bash process_pdfs.sh 2
使用示例: 正确使用:./process_pdfs.sh 2 (GPU利用率较高) 正确使用:./process_pdfs.sh 1 (GPU利用率较低) 错误使用:./process_pdfs.sh 3 (会显示错误信息) 错误使用:./process_pdfs.sh 0 (会显示错误信息)
注意:
在附件文件中,可以查看PDF中的48、49页,和在markdown文件中没有1.10这个标题,识别这2页的内容有部分错乱,以及部分内容没有识别出来
In the attached file, you can view the PDF in the 48, 49 pages, and in the markdown file does not have the title of 1.10, to identify the content of these 2 pages are partially misplaced, as well as part of the content is not recognized!
Theories of Truth - Richard Kirkham (... (Z-Library).zip
Theories of Truth - Richard Kirkham (... (Z-Library).pdf