Embedding model in the retrieval module

Shulin-Zhang commented 1 year ago

请问检索具体是如何实现的，用的什么embedding模型？

yueshengbin commented 1 year ago

Text2vector was used in the initial version

SUSTech-TP commented 1 year ago

请问检索增强部分，问题一: 检索的top-k是top多少？(k值)。问题二: 检索增强部分的图片(fig. 3)写的检索内容(knowledge base)会有(Statute、Case、literature三种)请问这部分数据会公开么？case怎么检索呢？问题三: 请问disc-lawllm的sft训练和Retrieval Augmentation是分两部分训练的么？那么sft训练时候时候会使用DISC-Law-SFT-Triplet数据么？还是只是使用DISC-Law-SFT-Pair? 谢谢

Charlie-XIAO commented 1 year ago

@SUSTechIR Here are the answers to your questions

@yueshengbin know better about this
The knowledge base will not be explicitly released. It currently consists of various Chinese Laws and judicial examinations, but we will continuously expand it in the future. The retrieval of cases is no difference from that of other types. We keep a "vector database" and use Text2vector for retrieval.
DISC-Law-SFT-Pair and DISC-Law-SFT-Triplet are both used for training. DISC-Law-SFT-Triplet helps enhance the model's ability to use external knowledge, thus working better with the retrieval module.

SUSTech-TP commented 1 year ago

@Charlie-XIAO 很高兴非常迅速收到您的回复，为了表达清楚我的观点，我使用中文提问，希望您能谅解。我的第二个问题中,关于Retrieval Augmentation这部分仍有一些不懂，我看您的技术报告提到了使用langchain的框架？那您的意思是在这个框架里指定text2vector作为encoder么？然后这部分参数随着整个training过程更新么？第三个问题也是这样的，因为我看数据DISC-Law-SFT-Pair是没有reference的对吧，那么DISC-Law-SFT-Pair是没法参与到Retrieval Augmentation的是吧？那么是先用DISC-Law-SFT-Pair and DISC-Law-SFT-Triplet数据做了sft训练，然后又单独做了Retrieval Augmentation的训练是么？那这部分的retrieval augment用的还是DISC-Law-SFT-Triplet数据么？然后还有个问题四:我看您技术文档中提到的Subjective Perspective(Figure 4)用的是chatgpt4, 文章其他部分提到是gpt3.5？这部分是笔误么？谢谢您的回复

Charlie-XIAO commented 1 year ago

@SUSTechIR It's Okay to ask in Chinese, and I'm replying in English for consistency with other issues.

Please refer to langchain-chatchat. The retrieval module is currently unrelated to the training process, though we may investigate more into how the retrieval module and the module training can be integrated to achieve better performance.
Yes, DISC-Law-SFT-Pair does not include reference while DISC-Law-SFT-Triplet does. But again, the retrieval module is unrelated to the training process of DISC-LawLLM in our current version. DISC-Law-SFT-Triplet only intends to familiarize the model with the "format". For instance, we have some instructions that include reference, with the output using the reference in some form. Then when we do model inference, we get the user input, use the retrieval module to find some references, combine it with the original user input to give to the model. Then we would expect the model to use that reference. 换言之，在目前的版本中，检索器不参与模型训练，只参与模型推理。推理时，我们用检索器在数据库中检索相关内容，拼接到用户的输入上，然后给到模型进行推理。为此，我们希望模型正确的理解到哪部分是 reference 并合理地进行运用。DISC-Law-SFT-Triplet 的目的就是为了让模型熟悉这种“带有 reference 的提问方式”。
This does not matter much. GPT4 would be a better referee during evaluation, but GPT-3.5 Turbo is already doing well with a properly-designed prompt. Due to pricing, we used GPT-3.5 Turbo to obtain the current data.

SUSTech-TP commented 1 year ago

@Charlie-XIAO ok, 明白了，非常感谢您的解答。

Geaming2002 commented 1 year ago

Hello, why is the model tested in the experiment without the retrieval augmentation, is there any benchmark test conducted with the retrieval augmentation?

Charlie-XIAO commented 1 year ago

The retrieval module is only an experimental feature for now. We will continuously improve it and expand its database, so no formal evaluation is done with it yet.

@yueshengbin who may know this better

SUSTech-TP commented 1 year ago

btw,顺便问下，那你们有测试w/o retrieval argument的disc-lawllm的实验结果么？可以公布？以及可以分享一下测试数据么？感谢！

Charlie-XIAO commented 1 year ago

@SUSTechIR It is only an experimental feature for now, so we will not release evaluation for this. However, there is indeed improvement that can be directly observed.

Charlie-XIAO commented 12 months ago

Closing as completed, due to long period of inactivity.

FudanDISC / DISC-LawLLM

Embedding model in the retrieval module #7