复现性能问题 - Githubissues

CreaterLL commented 2 years ago

目前跑了BERTAsp+SCAPT 关于Restaurant的结果效果差了好几个点但是直接使用官方公布的权重效果和论文一模一样不知道是我自己预训练+微调时哪一步有问题
直接跑代码，在预训练和微调时都会产生很多checkpoints 我都是选用的最后一个请问这里有问题吗

Tribleave commented 2 years ago

你好，关于预训练过程中可能产生的问题，我想了解你的实验环境和batch size。如果有调整需要确认下loss是否下降并收敛。对于checkpoint的问题，我们选用后4个epoch中所有checkpoint微调来选择模型，我猜想这是和结果不同的原因之一。在实验中我们也发现由于微调的数据集比较小，因此不同checkpoint间性能会出现波动。

CreaterLL commented 2 years ago

你好，实验环境：GeForce RTX 3090 24G ，虚拟环境按照github上配置的，batchsize等其他参数完全按照yml文件来的没有改动过。“对于checkpoint的问题，我们选用后4个epoch中所有checkpoint微调来选择模型”，请问这里的意思是选择后4个epoch中所有checkpoint的微调结果最好的吗，还是有其他的操作？另外，在实验过程中，train.py文件中的，下面两行代码似乎有点小问题，无法将checkpoint写入config里面。 if 'checkpoint' in config: config['checkpoint'] = args.checkpoint

------------------ 原始邮件 ------------------ 发件人: "Zhengyan @.>; 发送时间: 2021年11月24日(星期三) 上午9:02 收件人: @.>; 抄送: @.>; @.>; 主题: Re: [Tribleave/SCAPT-ABSA] 复现性能问题 (Issue #1)

你好，关于预训练过程中可能产生的问题，我想了解你的实验环境和batch size。如果有调整需要确认下loss是否下降并收敛。对于checkpoint的问题，我们选用后4个epoch中所有checkpoint微调来选择模型，我猜想这是和结果不同的原因之一。在实验中我们也发现由于微调的数据集比较小，因此不同checkpoint间性能会出现波动。

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

Tribleave commented 2 years ago

你好，我们选取模型的策略是这样的。我们在实验中也发现并不是最后一个（或若干个的checkpoint）是最好的。代码问题我已经更新了，感谢你反馈。这是我在整理代码时的疏忽！

CreaterLL commented 2 years ago

好的，谢谢！

------------------ 原始邮件 ------------------ 发件人: "Zhengyan @.>; 发送时间: 2021年11月26日(星期五) 中午11:34 收件人: @.>; 抄送: @.>; @.>; 主题: Re: [Tribleave/SCAPT-ABSA] 复现性能问题 (Issue #1)

你好，我们选取模型的策略是这样的。我们在实验中也发现并不是最后一个（或若干个的checkpoint）是最好的。代码问题我已经更新了，感谢你反馈。这是我在整理代码时的疏忽！

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

StudentWorker commented 2 years ago

这个配置能跑出来是吧，我现在是不断提示显存不足

StudentWorker commented 2 years ago

我想知道这个显存要吃到多少，才扛得住，我这会是1060 3G

CreaterLL commented 2 years ago

我想知道这个显存要吃到多少，才扛得住，我这会是1060 3G

目前我还跑不到论文中的效果。这个配置不建议你跑，3G的显存应该跑不了。

Tribleave commented 2 years ago

我想知道这个显存要吃到多少，才扛得住，我这会是1060 3G

我们的实验是在3090上跑的，3G显存预训练是不太可行了。

ljzsky commented 2 years ago

我没有改代码里的任何配置。请问要在yelp上训练多少个step，在res14的数据集上微调时能达到论文里的效果？我每20000个step的模型都试了一下，最好的是在140000步的时候到了86.7%，比不预训练高1%左右

Tribleave commented 2 years ago

我没有改代码里的任何配置。请问要在yelp上训练多少个step，在res14的数据集上微调时能达到论文里的效果？我每20000个step的模型都试了一下，最好的是在140000步的时候到了86.7%，比不预训练高1%左右

我们的实验中，BERT在SCAPT下的预训练速度都是比较快的，Yelp上释出的模型是我们在第5个epoch得到的结果。比不预训练提高1%感觉有些问题。

nitkannen commented 2 years ago

Hi @Tribleave, Hope you are doing good! I have a query about how the sentiment labels of MAMS dataset was decided during contrastive pre-training. Asking because the MAMS data sentence is not necessarily only positive or negative, but can contain multiple sentiments within the sentence for different aspects right? So how was the contrastive loss minimized by deciding a sentence label in such cases? Would love to know your view. Thanks!

Tribleave commented 2 years ago

Hi @Tribleave, Hope you are doing good! I have a query about how the sentiment labels of MAMS dataset was decided during contrastive pre-training. Asking because the MAMS data sentence is not necessarily only positive or negative, but can contain multiple sentiments within the sentence for different aspects right? So how was the contrastive loss minimized by deciding a sentence label in such cases? Would love to know your view. Thanks!

Hi @nitkannen It's an interesting problem because our pretraining method is applied on sentence level corpus, but our models still work well in MAMS. We guess that there are several source of aspect-level knowledge:

Masked aspect predicting objective. The model pretrained with this task will be sensitive to the aspect in reviews.
Aspect-aware finetuning. The finetuning method can bring the aspect information the classification, and we think it is the major cause that model can deal multi-aspect scenario.

nitkannen commented 2 years ago

Hi @Tribleave Thanks for your reply! It makes sense how the model has learnt aspect level information. Could you specify how you collected sentence-level sentiment corpus for pre-training? As in the paper, it is mentioned that Semeval-14 and MAMS data was used, but these are not sentence-level corpus, right? Were the sentences with a single sentiment picked to construct the corpus? Thanks.

Tribleave commented 2 years ago

Hi @nitkannen, Sorry for the late reply.

Our pretraining corpus for SemEval Restaurant and MAMS is collected from YELP, and use the corpus collected from Amazon for SemEval Laptop (see https://github.com/Tribleave/SCAPT-ABSA#for-pre-training). And the ABSA datasets are not used in pretraining.

As you mentioned, the original reviews in YELP and Amazon are document-level, and we preprocessed them into sentence-level data, you can see the details in paper Sec.4 Retrieved External Corpora part. (I think they are clear enough).

Thanks for your question and be free to ask if you still have some puzzles.

nitkannen commented 2 years ago

Hi, @Tribleave Thanks for your reply. What is your recommendation for the optimal number of epochs to train BERT or T5? Looks like the performance is high after a certain number of epochs, but decreases if the contrastive loss is allowed to decrease further. Any idea why this happens? Thanks in advance!

Tribleave / SCAPT-ABSA

复现性能问题 #1