His dataset has a big problem

YouNotWalkAlone commented 7 months ago

His model, if the dataset is replaced, can have normal binary classification performance. Firstly， on his dataset, the model loses to 0.69 because of the sigmoid function. Adding a bn layer can solve the problem. Secondly，if you solve the loss to 0.69 problem，you will find that this model performs well on the training set, but very poorly on the test and validation sets. I solved the problem by replacing the dataset So，I think the authors may have deliberately given an erroneous data set that prevented us from reproducing the results

MilkteaBoy-code commented 7 months ago

How did you solve this probelm#19 ? I am stuck on it.

YouNotWalkAlone commented 7 months ago

How did you solve this probelm#19 ? I am stuck on it.

I gave up on solving this problem and have been using minibatches for generation. The new problem I'm having now is that this model uses other datasets to learn, and the accuracy of his validation and test sets can't rise at around 0.7.

QQ图片20240426205455

QQ图片20240426205500 I don't have an idea of how to solve this problem

MilkteaBoy-code commented 7 months ago

How did you solve this probelm#19 ? I am stuck on it.

I gave up on solving this problem and have been using minibatches for generation. The new problem I'm having now is that this model uses other datasets to learn, and the accuracy of his validation and test sets can't rise at around 0.7.

I don't have an idea of how to solve this problem

你好，我看到你在#19中的回复，我猜想你应该也是中国人，在这里我就直接用中文和你交流。我看到他的论文中FFmpeg和QEMU这俩个数据集下的精度也就是0.7左右，所以我觉得你应该是复现成功了。我的问题是，在他给的数据集中（Fmpeg和QEMU），当把select函数限制放开采用所有的数据集进行复现的时候，就会卡死在第四个分片中。我不知道你有没有遇到这个问题，有没有解决？

MilkteaBoy-code commented 6 months ago

Hello？ Do you solve this problem?

YouNotWalkAlone commented 6 months ago

你是如何解决这个问题的#19 ？我被困住了。

我放弃了解决这个问题，并一直在使用小批量进行生成。我现在遇到的新问题是，这个模型使用其他数据集来学习，他的验证和测试集的准确率不能上升到0.7左右。我不知道如何解决这个问题

你好，我看到你在#19中的回复，我猜想你应该也是中国人，在这里我就直接用中文和你交流。我看到他的论文中FFmpeg和QEMU这俩个数据集下的精度也就是0.7左右，所以我觉得你应该是复现成功了。我的问题是，在他给的数据集中（Fmpeg和QEMU），当把select函数限制放开采用所有的数据集进行复现的时候，就会卡死在第四个分片中。我不知道你有没有遇到这个问题，有没有解决？

抱歉，我没有看到这个回复。你的图片我无法看到，但是我对于linux卡死问题是进行小批量的pkl文件生成，因为过大的卡死我也无法解决。

YouNotWalkAlone commented 6 months ago

你是如何解决这个问题的#19 ？我被困住了。

我放弃了解决这个问题，并一直在使用小批量进行生成。我现在遇到的新问题是，这个模型使用其他数据集来学习，他的验证和测试集的准确率不能上升到0.7左右。我不知道如何解决这个问题

你好，我看到你在#19中的回复，我猜想你应该也是中国人，在这里我就直接用中文和你交流。我看到他的论文中FFmpeg和QEMU这俩个数据集下的精度也就是0.7左右，所以我觉得你应该是复现成功了。我的问题是，在他给的数据集中（Fmpeg和QEMU），当把select函数限制放开采用所有的数据集进行复现的时候，就会卡死在第四个分片中。我不知道你有没有遇到这个问题，有没有解决？

我的整体思路和另外一个人的差不多，就是限制在200-400左右，避免卡死。这是linux虚拟机能够承受的，超过就会卡死，不知道能不能对你有所帮助

MilkteaBoy-code commented 6 months ago

你是如何解决这个问题的#19 ？我被困住了。

我放弃了解决这个问题，并一直在使用小批量进行生成。我现在遇到的新问题是，这个模型使用其他数据集来学习，他的验证和测试集的准确率不能上升到0.7左右。我不知道如何解决这个问题

你好，我看到你在#19中的回复，我猜想你应该也是中国人，在这里我就直接用中文和你交流。我看到他的论文中FFmpeg和QEMU这俩个数据集下的精度也就是0.7左右，所以我觉得你应该是复现成功了。我的问题是，在他给的数据集中（Fmpeg和QEMU），当把select函数限制放开采用所有的数据集进行复现的时候，就会卡死在第四个分片中。我不知道你有没有遇到这个问题，有没有解决？

我的整体思路和另外一个人的差不多，就是限制在200-400左右，避免卡死。这是linux虚拟机能够承受的，超过就会卡死，不知道能不能对你有所帮助

我看另一个人解决的是修改了select函数，这是我的回复https://github.com/epicosy/devign/issues/19#issuecomment-2106168341，你看一下我理解的对不对

epbugaev commented 6 months ago

His model, if the dataset is replaced, can have normal binary classification performance. Firstly， on his dataset, the model loses to 0.69 because of the sigmoid function. Adding a bn layer can solve the problem. Secondly，if you solve the loss to 0.69 problem，you will find that this model performs well on the training set, but very poorly on the test and validation sets. I solved the problem by replacing the dataset So，I think the authors may have deliberately given an erroneous data set that prevented us from reproducing the results

Hello! Am I correct in assuming by a bn layer you mean applying 1-d BatchNorm before the final linear head? I encountered the same problem (loss stuck on 0.68-69), applying 1-d BatchNorm seems to solve it.

Could you please explain why sigmoid function causes this? Is it because it disturbs gradient flow?

YouNotWalkAlone commented 6 months ago

His model, if the dataset is replaced, can have normal binary classification performance. Firstly， on his dataset, the model loses to 0.69 because of the sigmoid function. Adding a bn layer can solve the problem. Secondly，if you solve the loss to 0.69 problem，you will find that this model performs well on the training set, but very poorly on the test and validation sets. I solved the problem by replacing the dataset So，I think the authors may have deliberately given an erroneous data set that prevented us from reproducing the results

你好！我假设 bn 层是指在最终线性头之前应用 1-d BatchNorm 是否正确？我遇到了同样的问题（损失停留在 0.68-69 上），应用 1-d BatchNorm 似乎可以解决它。

您能解释一下为什么 S 形函数会导致这种情况吗？是因为它扰乱了梯度流吗？

Because the output of the sigmoid function is too close to 0.5, this problem can be solved by using the bn layer and expanding the variance, but the bn layer will be simpler.

epbugaev commented 6 months ago

His model, if the dataset is replaced, can have normal binary classification performance. Firstly， on his dataset, the model loses to 0.69 because of the sigmoid function. Adding a bn layer can solve the problem. Secondly，if you solve the loss to 0.69 problem，you will find that this model performs well on the training set, but very poorly on the test and validation sets. I solved the problem by replacing the dataset So，I think the authors may have deliberately given an erroneous data set that prevented us from reproducing the results

你好！我假设 bn 层是指在最终线性头之前应用 1-d BatchNorm 是否正确？我遇到了同样的问题（损失停留在 0.68-69 上），应用 1-d BatchNorm 似乎可以解决它。您能解释一下为什么 S 形函数会导致这种情况吗？是因为它扰乱了梯度流吗？

Because the output of the sigmoid function is too close to 0.5, this problem can be solved by using the bn layer and expanding the variance, but the bn layer will be simpler.

Thank you for the answer, it helped me a lot. I'm going to create a pull-request regarding this issue since it seems like this BatchNorm layer should be included by default, if you can look at it later I would be grateful.

epicosy / devign

His dataset has a big problem #22