[webcrawler] Support emitting non HTML documents (like PDFs...)

LangStream / langstream

LangStream. Event-Driven Developer Platform for Building and Running LLM AI Apps. Powered by Kubernetes and Kafka.

https://langstream.ai

Apache License 2.0

386 stars 28 forks source link

[webcrawler] Support emitting non HTML documents (like PDFs...) #739

Closed eolivelli closed 10 months ago

eolivelli commented 10 months ago

Summary:

add new option "allow-non-html-contents" to the webcrawler-source

Please note that in LangStream it is expected that the source only emits the records, it is up to the next agent in the pipeline to extract the text or manipulate the contents.

The "text-extractor" agent already handles pretty well PDF documents, thanks to Apache Tika