NUSTM / VLP-MABSA

107 stars 10 forks source link

Questions about how to process the original dataset #1

Open jc-ryan opened 2 years ago

jc-ryan commented 2 years ago

Nice work and nice repository ! But I still have some doubts about the repository~

  1. Could you please provide some more instructions about how to process the original MVSA dataset using the tools you mentioned? For example, what steps you have taken using the twitter_nlp to perform NER and how did you use sentiwordnet to matching the opinion words, finally what results we got after above processing. Same thing with Faster-RCNN and ANPs extractor.
  2. Could you please provide some sample data of the processed MVSA? It will be great if you could provide some example data into BaiduNetdisk, because I still have no idea about the exact data format with only MVSA_descriptions.txt provided, thus could not reproduce the pretraining part of your code.

Thanks a lot!

lyhuohuo commented 2 years ago

Thank you for your questions, i have added some details of processing the pre-training dataset in the README.md. I hope this could help you understand the pre-processing.

PKUCSS commented 1 year ago

Thank you for your questions, i have added some details of processing the pre-training dataset in the README.md. I hope this could help you understand the pre-processing.

Thanks for your excellent work and patient feedback. Could you please release the processed pre-training data for better reproductivity?

SilyRab commented 1 year ago

请问对使用twitter_nlp工具没有抽出实体(方面术语)的样例,是删除了还是做了另外的处理?

SilyRab commented 1 year ago

请问最终预训练的数据量大概是多少

lyhuohuo commented 1 year ago

1.对于twitter_nlp工具没有抽出实体我们在预训练当中是作空处理的,因为下游任务上也有不存在实体的情况。 2.预训练的数据量大概是17000多。