TechnionTDK / jbs-ir

0 stars 0 forks source link

Information Retrieval System For The Jewish Bookshelf Project

Introduction

This project is a part of the Jewish Bookshelf ecosystem in the Technion Data and Knowledge Lab.

It integrates several COTS projects in order to create a search engine for JBS data. The data is pulled from jbs-text repository, which contains JSON objects that describe the JBS data.

The COTSs which are integrated as part of the search engine

We also deliver two tools for administrators

In this README you will learn

Repository content

Installation

In order to bring up the search engine on a new machine, we need to have the following components integrated

Solr

Installing Solr

  1. Follow this guide to install Solr on your machine: Solr installation guide
  2. From the Solr directory start Solr using bin/solr start command. This will start the Solr server in the background listening for requestes on port 8983
  3. Make sure Solr is running using bin/solr status command, to check Solr started correctly
  4. Create a new core using bin/solr create -c <core-name> command (for example: bin/solr create -c jbs-ir)

Solr supports multiple cores under one Solr instance. Each core can be addressed by adding the the core name to Solr URL: http://<machine-name>:<Solr-port>/solr/<core-name>. For example: http://tdk2.cs.technion.ac.il:8983/solr/jbs-ir.

Configuring Solr

Before indexing documents with Solr, we need to tell Solr what kind of data we would like to search, store and analyze in our documents and how.

Before modifying Solr core files, we also advise you to read Documents, Fields, and Schema Design.

In this section you will replace two default core files with ones that we modified. To understand what changes have been applied to the files, or to learn how to index additional fields in your documents - visit this Wiki page.

Instructions

HebMorph

The text in the index will be in Hebrew, so we need to use an appropriate hebrew analyzer.

We encourage you to read this wiki to have better understanding of the changes you are going to perform: Understanding Analyzers, Tokenizers, and Filters.

We chose to use the HebMorph hewbrew analyzer: Hebmorph github repository.

To integrate HebMorph into your Solr core follow SOLR-README.md in Hebmorph github repository. We provide here practical integration guidelines that should suffice:

Indexing documents from jbs-text using jbs-ir

Indexing is done according to managed-schema file we discussed before. In order to index the relevant documents with Solr, please follow the next steps.

Admin UI

You can use the Solr Admin UI for running queries, analysis and viewing core details. Please visit Overview of the Solr Admin UI for more information.

You can access the Admin UI at: http://<machine-name>:<Solr-port>.

To access a specific core: http://<machine-name>:<Solr-port>/#/<core-name>.

You can read about the most useful features we found in the Admin UI in Useful Admin UI Features.

Velocity UI for searching

There is a basic UI for searching which you can access at: http://<machine-name>:<Solr-port>/solr/<core-name>/browse.

We wanted to make some adjusments to that UI so it will present the data in a more friendly way.

Solr uses Velocity for their web UIs, so we worked on top of the example files under <solr-home-dir>/example/files/conf/velocity.

For more information about the files we changed to configure the UI read the following wiki page: (Useful Velocity files)[https://github.com/TechnionTDK/jbs-ir/wiki/Useful-Velocity-files]

To read more about Velocity, go to The Apache Velocity Project.

Get the Velocity web UI we configured

Evaluation tool

We included an evaluation tool for the Solr search engine. The tool allows the user to automate the evaluation of the engine and extract any required data by:

In order to use the tool (after cloning this repository) take a look at execute() method in JbsIrTestTool.java class, this method demonstrates how the tool can be used.

When running from IntelliJ IDEA (or other IDE)

Before running the Main method in the IDE, you have to configure Program Arguments in Run/Debug configurations to contain your Solr core address in this format: http://<machine-name>:<Solr-port>/solr/<core-name>.

When running from command line

In case you changed the code and want to create a .jar file after running the mvn package command, do the following:

  1. The .jar can be found in jbs-ir/evaluation/target/evaluation-1.0-jar-with-dependencies.jar
  2. Run: java -jar evaluation-1.0-jar-with-dependencies.jar http://<machine-name>:<Solr-port>/solr/<core-name>