Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

Crawling JSON #601

Open coprisanu opened 5 years ago

coprisanu commented 5 years ago

Hi,

   We need to crawl a JSON file and to split its content into smaller documents to be indexed in Elasticsearch. We have noticed there are already implementations like CVSSplitter, DOMSplitter or PDFsplitter, is there one for JSON?

Thank you

essiembre commented 5 years ago

No, there are currently none. Good idea though. I will mark as a feature request. In the meantime, if you know your Java, you can implement your own solution by extending AbstractDocumentSplitter (feel free to share).