Mintplex-Labs / anything-llm

The all-in-one Desktop & Docker AI application with full RAG and AI Agent capabilities.
https://useanything.com
MIT License
16.76k stars 1.79k forks source link

[FEAT]: Add Bookmarks to PDF Metadata in PDFLoader #1784

Closed justinledwards closed 7 hours ago

justinledwards commented 5 days ago

What would you like to see?

In collector/processSingleFile/convert/asPDF.js

You could add the metadata for the chapter or bookmark by extending pdfjs to grab the extra metadata.

This would allow the user to find the relevant information or judge the relevancy of the information used by the llm more easily.

const pdfjsLib = require('pdfjs-dist');

class PDFLoader {
  constructor(filePath, options = {}) {
    this.filePath = filePath;
    this.options = options;
  }

  async load() {
    const loadingTask = pdfjsLib.getDocument(this.filePath);
    const pdf = await loadingTask.promise;

    const numPages = pdf.numPages;
    const pages = [];
    const metadata = await pdf.getMetadata();
    const outline = await pdf.getOutline();

    for (let i = 1; i <= numPages; i++) {
      const page = await pdf.getPage(i);
      const content = await page.getTextContent();
      const text = content.items.map(item => item.str).join(' ');

      pages.push({
        pageNumber: i,
        text,
        outline
      });
    }

    return {
      pages,
      metadata,
      outline
    };
  }
}

module.exports = PDFLoader;
timothycarambat commented 5 days ago

Very cool! We will probably interface with pdfjs directly and rip out the simple LC loader we use right now so we can get this metadata and display it properly.

Related https://github.com/Mintplex-Labs/anything-llm/issues/392