Running GAMF and error occur：DefaultCPUAllocator: can't allocate memory

YCaigogogo commented 1 year ago

开发者，您好，很感谢你们的优秀工作。最近我在跑你们的GAMF的代码来对两个vgg 11模型进行model fusion，但似乎出现了CPU爆内存的问题，报错信息如下： Relu Inplace is False Loaded parameters (file 0): [features.8.weight, features.16.weight, features.3.weight, features.0.weight, features.6.weight, features.18.weight, classifier.weight, features.11.weight, features.13.weight] Traceback (most recent call last): File "/data/yic/LAMDA-ZhiJian/main.py", line 15, in trainer = prepare_trainer(args) File "/data/yic/LAMDA-ZhiJian/zhijian/trainers/base.py", line 36, in prepare_trainer return get_class_from_module(f'zhijian.trainers.{args.training_mode}', 'Trainer')(args, *kwargs) File "/data/yic/LAMDA-ZhiJian/zhijian/trainers/model_merging.py", line 134, in init self.model = core_fn(self.model, merging_models_list) File "/data/yic/LAMDA-ZhiJian/zhijian/models/model_merging/method/gamf.py", line 27, in core K, params = self.graph_matching_fusion(merge_models_list) File "/data/yic/LAMDA-ZhiJian/zhijian/models/model_merging/method/gamf.py", line 53, in graph_matching_fusion affinity = torch.zeros([n1 n2, n1 * n2]).cuda() RuntimeError: [enforce fail at alloc_cpu.cpp:75] err == 0. DefaultCPUAllocator: can't allocate memory: you tried to allocate 233797861202500 bytes. Error code 12 (Cannot allocate memory)

我打印出了n1和n2的值，它们为2765，看起来是affinity矩阵太大了(维度约为 9e6 × 9e6)，我想请问这种情况是正常的吗，该如何解决呢？

only-changer commented 1 year ago

你好，我没有在我的代码里找到你报错的语句，看起来是你对我们的代码进行一些修改。这里我们设计的n1和n2应该是神经网络每一层的channel数，一般是512/1024这种，确实可能会有些大，但是你这里的2765有些奇怪，可能是你在修改代码的时候使用了我们的全局匹配方法？就是我们是有一个方法会直接把神经网络中所有的channel（而不是每一层的channel）建模成一张大图进行匹配的，然后由于这样做图可能会太大，我们才随后开发了分层匹配的版本。麻烦你再检查下你修改后的代码~

YCaigogogo commented 1 year ago

你好，我是尝试跑通你们tutorial的这个示例 https://pygmtools.readthedocs.io/en/latest/auto_examples/pytorch/plot_model_fusion_pytorch.html ，似乎这个例子使用的是全局匹配方法，如果我想使用分层匹配，我应该如何修改呢

rogerwwww commented 1 year ago

如果我想使用分层匹配，我应该如何修改呢

可以参考这个仓库的实现 https://github.com/Thinklab-SJTU/GAMF

Thinklab-SJTU / pygmtools

Running GAMF and error occur：DefaultCPUAllocator: can't allocate memory #67