Mintplex-Labs / anything-llm

The all-in-one Desktop & Docker AI application with full RAG and AI Agent capabilities.
https://anythingllm.com
MIT License
17.85k stars 1.92k forks source link

Large file size? #649

Closed xiaoxiao261258 closed 5 months ago

xiaoxiao261258 commented 6 months ago

What would you like to see?

I tried to upload a big PDF file which was about 330M for 3 times and they were all failed.

timothycarambat commented 6 months ago

Can you please explain your bug?

xiaoxiao261258 commented 5 months ago

Can you please explain your bug?

  • How are you running AnythingLLM?
  • What embedder are you using?
  • Do you see any logs available in Docker, if applicable - the full error will be in there.

I deployed with a docker as the guide told me. I used a local AI model which was vicuna-13b-1.5. When I tried to upload a mp4 file, the log was as follows.

[Conversion Required] .mp4 file detected - converting to .wav
[Conversion Processing]: 4KB converted
[Conversion Processing]: 37284KB converted
[Conversion Complete]: File converted to .wav!
Failed to load the native whisper model: TypeError: fetch failed
    at Object.fetch (node:internal/deps/undici/undici:11730:11)
    at processTicksAndRejections (node:internal/process/task_queues:95:5)
    at runNextTicks (node:internal/process/task_queues:64:3)
    at process.processImmediate (node:internal/timers:447:9)
    at async getModelFile (file:///app/collector/node_modules/@xenova/transformers/src/utils/hub.js:471:24)
    at async getModelJSON (file:///app/collector/node_modules/@xenova/transformers/src/utils/hub.js:575:18)
    at async Promise.all (index 0)
    at async loadTokenizer (file:///app/collector/node_modules/@xenova/transformers/src/tokenizers.js:52:16)
    at async AutoTokenizer.from_pretrained (file:///app/collector/node_modules/@xenova/transformers/src/tokenizers.js:3940:48)
    at async Promise.all (index 0) {
  cause: ConnectTimeoutError: Connect Timeout Error
      at onConnectTimeout (node:internal/deps/undici/undici:6869:28)
      at node:internal/deps/undici/undici:6825:50
      at Immediate._onImmediate (node:internal/deps/undici/undici:6857:13)
      at process.processImmediate (node:internal/timers:476:21) {
    code: 'UND_ERR_CONNECT_TIMEOUT'
  }
}
node:internal/process/promises:288
            triggerUncaughtException(err, true /* fromPromise */);
            ^

TypeError: fetch failed
    at Object.fetch (node:internal/deps/undici/undici:11730:11)
    at processTicksAndRejections (node:internal/process/task_queues:95:5)
    at runNextTicks (node:internal/process/task_queues:64:3)
    at process.processImmediate (node:internal/timers:447:9)
    at async getModelFile (file:///app/collector/node_modules/@xenova/transformers/src/utils/hub.js:471:24)
    at async getModelJSON (file:///app/collector/node_modules/@xenova/transformers/src/utils/hub.js:575:18)
    at async Promise.all (index 0)
    at async loadTokenizer (file:///app/collector/node_modules/@xenova/transformers/src/tokenizers.js:52:16)
    at async AutoTokenizer.from_pretrained (file:///app/collector/node_modules/@xenova/transformers/src/tokenizers.js:3940:48)
    at async Promise.all (index 0) {
  cause: ConnectTimeoutError: Connect Timeout Error
      at onConnectTimeout (node:internal/deps/undici/undici:6869:28)
      at node:internal/deps/undici/undici:6825:50
      at Immediate._onImmediate (node:internal/deps/undici/undici:6857:13)
      at process.processImmediate (node:internal/timers:476:21) {
    code: 'UND_ERR_CONNECT_TIMEOUT'
  }
}

When I tried to upload a PDF file, it said an error occured when process the file.Following was the log.

Warning: Indexing all PDF objects
Error
    at InvalidPDFExceptionClosure (/app/collector/node_modules/pdf-parse/lib/pdf.js/v1.10.100/build/pdf.js:452:35)
    at Object.<anonymous> (/app/collector/node_modules/pdf-parse/lib/pdf.js/v1.10.100/build/pdf.js:455:2)
    at __w_pdfjs_require__ (/app/collector/node_modules/pdf-parse/lib/pdf.js/v1.10.100/build/pdf.js:45:30)
    at Object.<anonymous> (/app/collector/node_modules/pdf-parse/lib/pdf.js/v1.10.100/build/pdf.js:7939:23)
    at __w_pdfjs_require__ (/app/collector/node_modules/pdf-parse/lib/pdf.js/v1.10.100/build/pdf.js:45:30)
    at /app/collector/node_modules/pdf-parse/lib/pdf.js/v1.10.100/build/pdf.js:88:18
    at /app/collector/node_modules/pdf-parse/lib/pdf.js/v1.10.100/build/pdf.js:91:10
    at webpackUniversalModuleDefinition (/app/collector/node_modules/pdf-parse/lib/pdf.js/v1.10.100/build/pdf.js:18:20)
    at Object.<anonymous> (/app/collector/node_modules/pdf-parse/lib/pdf.js/v1.10.100/build/pdf.js:25:3)
    at Module._compile (node:internal/modules/cjs/loader:1356:14)
    at Module._extensions..js (node:internal/modules/cjs/loader:1414:10)
    at Module.load (node:internal/modules/cjs/loader:1197:32)
    at Module._load (node:internal/modules/cjs/loader:1013:12)
    at ModuleWrap.<anonymous> (node:internal/modules/esm/translators:202:29)
    at ModuleJob.run (node:internal/modules/esm/module_job:195:25)
    at async ModuleLoader.import (node:internal/modules/esm/loader:336:24) {
  message: 'Invalid PDF structure'
}

For a docx file, there was also someting with wrong.

-- Working SPG-Maoxin Shen-User manual.docx --
Error: Can't find end of central directory : is this a zip file ? If it is, see https://stuk.github.io/jszip/documentation/howto/read_zip.html
    at ZipEntries.readEndOfCentral (/app/collector/node_modules/jszip/lib/zipEntries.js:167:23)
    at ZipEntries.load (/app/collector/node_modules/jszip/lib/zipEntries.js:255:14)
    at /app/collector/node_modules/jszip/lib/load.js:48:24
-- Working SPG-Maoxin Shen-User manual.docx --
Error: Can't find end of central directory : is this a zip file ? If it is, see https://stuk.github.io/jszip/documentation/howto/read_zip.html
    at ZipEntries.readEndOfCentral (/app/collector/node_modules/jszip/lib/zipEntries.js:167:23)
    at ZipEntries.load (/app/collector/node_modules/jszip/lib/zipEntries.js:255:14)
    at /app/collector/node_modules/jszip/lib/load.js:48:24

A pptx file was faild to upload too.

-- Working 2024 SRCX R&D E3 Management Guide.pptx --
Could not parse office or office-like file [OfficeParser]: Your file officeParserTemp/tempfiles/170625926973600000.pptx seems to be corrupted. If you are sure it is fine, please create a ticket in Issues on github with the file to reproduce error.
Resulting text content was empty for 2024 SRCX R&D E3 Management Guide.pptx.

At last, I tested it with a big .md file about 89M. It could be uploaded successfully but failed when embedding.

Document cc.md uploaded processed and successfully. It is now available in documents.
[TELEMETRY SENT] {
  event: 'document_uploaded',
  distinctId: '5a9be5db-2681-43cb-bfbb-8722eaa85ec4',
  properties: { runtime: 'docker' }
}
Adding new vectorized document into namespace chao1.chen
Chunks created from document: 100334
LocalAI:listModels Request failed with status code 500
addDocumentToNamespace LocalAI Failed to embed: [500]: Request failed with status code 500
Failed to vectorize cc.md

In fact, only .txt and .md files with small size were uploaded successfully. I'm not sure if this problem is caused by the file being too large or not supported file formats.

timothycarambat commented 5 months ago

Almost all of this seems to be from the collector.

  1. The mp4 file failed because your container could not download the hugging face model used to parse audio from mp4s. We have seen this happen to users using VPNs, in restricted IPs for using HuggingFace, or if running the docker container in an emulation mode.

  2. The other failures seem to do with correct or invalid document types. Are you sure these documents are valid? Can you open the docx in Word or in Google Docs? There should be no limitation on file size and instead it looks to have to do with the file itself.

There is an opportunity for the document to be corrupted during upload. If you are running docker as outlined in our documents could you open the storage/documents folder and see if your files are in there and are openable?

The last error seems to be an internal server error from LocalAI, not from AnythingLLM. This would imply your connection to your local instance is invalid. If LocalAI is returning a 500 error then something it wrong with your LocalAI instance as well - that would be outside of this repo and the logs in LocalAI will tell you what is wrong on that end.