fastnlp / fastNLP

fastNLP: A Modularized and Extensible NLP Framework. Currently still in incubation.
https://gitee.com/fastnlp/fastNLP
Apache License 2.0
3.05k stars 451 forks source link

关于文档改进和callback兼容性 #386

Open ccyousa opened 2 years ago

ccyousa commented 2 years ago

Describe the bug A clear and concise description of what the bug is.

  1. DistTrainer 和 Trainer 不仅是分布式与独立训练的差异,对各种原生callback的支持存在显著不同。 建议增加文档说明各原生callback对两个trainer的支持情况,毕竟有些callback要在若干个epoch运行以后才会触发bug;

  2. (Dist)trainer,指定save path后保存模型,不能指定仅保存参数,而保存完整模型容易触发pickling error;

To Reproduce Steps to reproduce the behavior:

例如,DistTrainer 使用 SaveModelCallback,存在bug

  ....
  File "/path_to_env/.conda/envs/pt19/lib/python3.8/site-packages/fastNLP/core/callback.py", line 1089, in on_valid_end
    self._save_this_model(metric_value)
  File "/path_to_env/.conda/envs/pt19/lib/python3.8/site-packages/fastNLP/core/callback.py", line 1112, in _save_this_model
    save_pair, delete_pair = self._insert_into_ordered_save_models((metric_value, name))
  File "//path_to_env/.conda/envs/pt19/lib/python3.8/site-packages/fastNLP/core/callback.py", line 1098, in _insert_into_ordered_save_models
    if not self.trainer.increase_better and _pair[0]<=pair[0]:
AttributeError: 'DistTrainer' object has no attribute 'increase_better'

另外,fastNLP还有更新计划吗?看起来很久没有更新了

yhcc commented 2 years ago

嗯,我们也在计划将DistTrainer和Trainer进行合并,这样使得启动单机代码和多卡代码是同一份代码。