garganshulgarg / hybrisEnhancedSearch

This project aims for extending solr search delivered OOTB with SAP Hybris. This will majorly include content search, pdf search etc.
2 stars 1 forks source link

PDF Search #5

Open garganshulgarg opened 6 years ago

garganshulgarg commented 6 years ago

All the PDF present on e-commerce site should be searchable using Search Box.

garganshulgarg commented 6 years ago

High-Level Plan for extracting PDFs in Hybris.

  1. Create a separate itemType to hold all the documents. That itemType should extend Media.
  2. All the PDFs will be part of DocMedia ItemType in Hybris which will hold some additional attributes to have some restriction in place for indexing certain docs.
  3. We will be using Apache Tika and figured some high-level concepts which might need to be followed for enhancing the search. URLs for same https://dzone.com/articles/solr-and-tika-integration-part https://gist.github.com/johnmiedema/11224886 https://dzone.com/articles/apache-tika-and-apache-opennlp-for-easy-pdf-parsin - Appears Good for read
garganshulgarg commented 6 years ago

Apache Solr Support for Apache Tika !! Nice Read : https://lucidworks.com/2009/09/02/content-extraction-with-tika/

garganshulgarg commented 6 years ago

https://lucene.apache.org/solr/guide/7_2/uploading-data-with-solr-cell-using-apache-tika.html