cd backend
pip install -r requirements.txt
To use the doc2query implementation download the models by running
pip install --upgrade git+https://github.com/terrierteam/pyterrier_doc2query.git
pip install --upgrade git+https://github.com/terrierteam/pyterrier_dr.git
To run the crawler execute
cd backend/core
python Crawler.py
The crawled documents are stored as pickle files inside serialization\documents
.
The state of the crawler is also stored in this folder as crawl_state.pickle
. When a crawl_state file already exists,
the crawler will load the file and resume the crawling from the loaded frontier.
To index all documents execute
cd backend/core
python DocumentIndex.py
The generated index file is stored at serialization\documents\index.pickle
.
The topic model is needed to rerank documents for diversity. A model trained on already crawled documents is stored atserialization\ldamodel.pickle
.
To train a new model on your crawled documents execute:
cd backend/core
python LDAmodel.py
To start the django rest server:
cd backend/SearchEngineServer
python manage.py runserver
This will automatically load the index inside the serialization folder. The server will listen on port 8000 to answer query requests.
Starts the frontend on port 3000
cd frontend
npm install
npm start
cd backend/core
python batch_retrieve.py path/to/queryFile.txt path/to/resultList.txt
The query file should contain one query number and one query per line seperated by tabs.
cd backend/core
python streamlit run DataCreator.py