π Paper (KDD'23) β’ π δΈζ README β’ π€ HF Repo [WebGLM-10B] [WebGLM-2B] β’ π Dataset [WebGLM-QA]
This is the official implementation of WebGLM. If you find our open-sourced efforts useful, please π the repo to encourage our following developement!
[Please click to watch the demo!]
_Read this in δΈζ._
[2023/06/25] Release ChatGLM2-6B, an updated version of ChatGLM-6B which introduces several new features:
More details please refer to ChatGLM2-6Bγ
WebGLM aspires to provide an efficient and cost-effective web-enhanced question-answering system using the 10-billion-parameter General Language Model (GLM). It aims to improve real-world application deployment by integrating web search and retrieval capabilities into the pre-trained language model.
Clone this repo, and install python requirements.
pip install -r requirements.txt
Install Nodejs.
apt install nodejs # If you use Ubuntu
Install playwright dependencies.
playwright install
If browsing environments are not installed in your host, you need to install them. Do not worry, playwright will give you instructions when you first execute it if so.
In search process, we use SerpAPI to get search results. You need to get a SerpAPI key from here.
Then, set the environment variable SERPAPI_KEY
to your key.
export SERPAPI_KEY="YOUR KEY"
Alternatively, you can use Bing search with local browser environment (playwright). You can add --searcher bing
to start command lines to use Bing search. (See Run as Command Line Interface and Run as Web Service)
Download the checkpoint on Tsinghua Cloud by running the command line below.
You can manually specify the path to save the checkpoint by --save SAVE_PATH
.
python download.py retriever-pretrained-checkpoint
Before you run the code, make sure that the space of your device is enough.
Export the environment variable WEBGLM_RETRIEVER_CKPT
to the path of the retriever checkpoint. If you have downloaded the retriever checkpoint in the default path, you can simply run the command line below.
export WEBGLM_RETRIEVER_CKPT=./download/retriever-pretrained-checkpoint
You can try WebGLM-2B model by:
python cli_demo.py -w THUDM/WebGLM-2B
Or directly for WebGLM-10B model:
python cli_demo.py
If you want to use Bing search instead of SerpAPI, you can add --searcher bing
to the command line, for example:
python cli_demo.py -w THUDM/WebGLM-2B --searcher bing
Run web_demo.py
with the same arguments as cli_demo.py
to start a web service.
For example, you can try WebGLM-2B model with Bing search by:
python web_demo.py -w THUDM/WebGLM-2B --searcher bing
Download the training data (WebGLM-QA) on Tsinghua Cloud by running the command line below.
python download.py generator-training-data
It will automatically download all the data and preprocess them into the seq2seq form that can be used immediately in ./download
.
Please refer to GLM repo for seq2seq training.
Download the training data on Tsinghua Cloud by running the command line below.
python download.py retriever-training-data
Run the following command line to train the retriever. If you have downloaded the retriever training data in the default path, you can simply run the command line below.
python train_retriever.py --train_data_dir ./download/retriever-training-data
You can reproduce our results on TriviaQA, WebQuestions and NQ Open. Take TriviaQA for example, you can simply run the command line below:
bash scripts/triviaqa.sh
and start running the experiment.
Here you can see some examples of WebGLM real application scenarios.
This repository is licensed under the Apache-2.0 License. The use of model weights is subject to the Model_License. All open-sourced data is for resarch purpose only.
If you use this code for your research, please cite our paper.
@misc{liu2023webglm,
title={WebGLM: Towards An Efficient Web-Enhanced Question Answering System with Human Preferences},
author={Xiao Liu and Hanyu Lai and Hao Yu and Yifan Xu and Aohan Zeng and Zhengxiao Du and Peng Zhang and Yuxiao Dong and Jie Tang},
year={2023},
eprint={2306.07906},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
This repo is simplified for easier deployment.