add new option "allow-non-html-contents" to the webcrawler-source
Please note that in LangStream it is expected that the source only emits the records, it is up to the next agent in the pipeline to extract the text or manipulate the contents.
The "text-extractor" agent already handles pretty well PDF documents, thanks to Apache Tika
Summary:
Please note that in LangStream it is expected that the source only emits the records, it is up to the next agent in the pipeline to extract the text or manipulate the contents.
The "text-extractor" agent already handles pretty well PDF documents, thanks to Apache Tika