dadoonet / fscrawler

Elasticsearch File System Crawler (FS Crawler)
https://fscrawler.readthedocs.io/
Apache License 2.0
1.34k stars 297 forks source link

Allow using an external processor to process data #1816

Open dadoonet opened 7 months ago

dadoonet commented 7 months ago

For example, we can imagine generating embeddings from a given document, let say a dir full of images. Not sure how flexible this can be...

Using https://github.com/langchain4j/langchain4j might help here.

Morphus1 commented 1 month ago

I just added another crawler that takes the output of the Tika process Doc.content and processes using llama.cpp. Creates embeddings at sentence level then aggregates/averages up to paragraph and document level. Classifies against an embedded list of descriptions using cosine sim. adds the data to the doc class and FSCrawler indexes as normal.

Running into issues with the bulk processor sometimes for large docs. 10000+ 4096 arrays, Kibana doesn't like searching through all that data either.

dadoonet commented 1 month ago

@Morphus1 I'd love to hear more about what you did exactly. I think it could be a good documentation addition as well.

Running into issues with the bulk processor sometimes for large docs. 10000+ 4096 arrays, Kibana doesn't like searching through all that data either.

One recommendation: exclude the vector field from the _source. That should solve the Kibana issue. Something like:

{
  "mappings": {
    "_source": {
      "excludes": [
        "content_vector"
      ]
    }
  }
}

For the bulk part, indeed, I guess it could fail on FSCrawler side depending on the HEAP you allocated to FSCrawler or could be rejected by Elasticsearch if the content size is too big for the HTTP request.

You might want to tune a bit the bulk settings:

name: "test"
elasticsearch:
  bulk_size: 1000
  byte_size: "10mb"
  flush_interval: "10s"