RUCAIBox / CRSLab

CRSLab is an open-source toolkit for building Conversational Recommender System (CRS).
https://github.com/RUCAIBox/CRSLab
MIT License
503 stars 113 forks source link

Inspired data Processing #37

Closed ahtsham58 closed 2 years ago

ahtsham58 commented 2 years ago

Dear authors,

I have a use case of using a bit modified version of the INSPIRED dataset (having the same format as the original) that I want to use for the different models implemented in CRSLab. As I see, here are the preprocessed data files from the original INSPIRED dataset, I guess I need to create similar files in order to work with the modified data. image

Could you please guide me to produce similar files or provide a script that is used to convert the original dataset files?

Thanks in advance!

ahtsham58 commented 2 years ago

Hi @ToheartZhang @batmanfly @turboLJY @Lancelot39

I have modified the files such as entity2id.json, movie_ids.json, and all the dialog files like the test, train, and valid datasets but my format is exactly the same as the original one used in the CRSLab toolkit. Specifically, I added new entities, like people, movies in the dialog dataset which were missing in the original files. Unfortunately, the model (redial) does not train using a modified dataset.

On debugging, I found that my data has been successfully preprocessed and integrated with the data loader, which means my format is correct.

Do you only add entities into entity2id.json that are present in subkg.json?

Is there anything that I am missing to make this modified dataset compatible with the toolkit?

Your quick response will be highly appreciated. Thanks!

Zilize commented 2 years ago

entity2id.json is the mapping file for the knowledge graph subkg.json, which means that all the entity indices presented in subkg.json should be indicated in entity2id.json. Formally, the set of entity indices for subkg.json is the subset of the set of entity indices for entity2id.json.

One unfortunate thing I would remind you of is that, due to the limitation of the toolkit, the sociable strategies annotations were ignored by the data pipeline, which is the key parts of the dataset INSPIRED. The pretraining modules of model ReDial is also ignored, which plays an important role in the performance.

ahtsham58 commented 2 years ago

@Zilize Thanks for your detailed response. Since subkg.json is the subset of entity2id.json, I don't think adding more entities into the entity2id.json should be a problem. So now my question stands there, what could cause the CRSLab toolkit to not support my data as it is being preprocessed in the toolkit but not getting to the training, testing mode.

Secondly, I cannot find the sentiment analysis module for the ReDial model, which was originally used (in redial paper) as an input to the recommender module. Please share your insights in this regard? I may share inspired data files with you on request.

Thanks in advance!.

ahtsham58 commented 2 years ago

Finally, I figured out the problem, which occurred during the batching process. You may close the issue.