Mintplex-Labs / anything-llm

The all-in-one Desktop & Docker AI application with built-in RAG, AI agents, and more.
https://anythingllm.com
MIT License
24.84k stars 2.51k forks source link

PDF Title Property Reflects Original TXT File Path Instead of Descriptive Title[BUG]: #837

Closed kijukusanagi closed 7 months ago

kijukusanagi commented 7 months ago

How are you running AnythingLLM?

Docker (local)

What happened?

I expected the file to be the same name as I gave it on my computer. The Title property of the PDF should be set to a descriptive title related to the document's content or the name of the original file, without the path.

Actual Behavior: The Title property is set to the full path of the original .txt file, including drive and folders (e.g., D:\dist\text\1016_rer.txt).

Happy to explain further if needed but wanted to keep this as concise as possible.

Are there known steps to reproduce?

I'm pulling PDFs from a government website, and for whatever reason it's only with how their file properties are set up that creates this scenario.

Here's the file properties: File name: 2024a_1016_rer.pdf File size: 144 KB Title: D:\dist\text\1016_rer.txt ( I renamed this to 2024-SB 151, but it reverts back to this original title given.) Author: Domino_Admin Subject: - Keywords: - Created: 2/23/24, 9:47:49 AM Modified: 2/23/24, 9:47:49 AM Application: PScript5.dll Version 5.2.2 PDF producer: Acrobat Distiller 23.0 (Windows) PDF version: 1.7 Page count: 3 Page size: 8.50 × 11.00 in (portrait) Fast web view: Yes

timothycarambat commented 7 months ago

Related: https://github.com/Mintplex-Labs/anything-llm/blob/60fc5f715ad85b0b5f13ddf4eb221eb89eb7d9da/collector/processSingleFile/convert/asPDF.js#L43

Also related: https://github.com/Mintplex-Labs/anything-llm/issues/816