alibaba / FederatedScope

An easy-to-use federated learning platform
https://www.federatedscope.io
Apache License 2.0
1.26k stars 206 forks source link

训练得到的total_flops是负数 #757

Open summer-f opened 6 months ago

summer-f commented 6 months ago

这是训练结束后得到的部分输出(monitor:173) INFO: In worker #1, the system-related metrics are: {'id': 1, 'fl_end_time_minutes': 125.160295, 'total_model_size': 124440576, 'total_flops': -3000, 'total_upload_bytes': 0, 'total_download_bytes': 16470960, 'global_convergence_round': 0, 'local_convergence_round': 0, 'global_convergence_time_minutes': 0, 'local_convergence_time_minutes': 0}。其中total_flops是-3000,请问会是什么原因导致的呢?是因为数据溢出了吗?

yxdyc commented 6 months ago

这是因为基于fvcore和ctx.data_batch =[x, y]的假设不符合,触发了异常。FS在这种场景下会将该值设置为负数,来提醒用户flops并未计算成功,需要实现合适的flops_per_sample函数。可以参考general_torch_trainer的实现,以及相关的搜索

summer-f commented 5 months ago
感谢您的回复,很好的解决了我的疑惑。

在 2024-02-29 14:30:09,"Daoyuan Chen" @.***> 写道:

这是因为基于fvcore和ctx.data_batch =[x, y]的假设不符合,触发了异常,FS会将其设置为负数,来提醒用户flops并未计算成功,需要实现合适的flops_per_sample函数,可以参考general_torch_trainer的实现,以及相关的搜索

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>