alibaba / graph-learn

An Industrial Graph Neural Network Framework
Apache License 2.0
1.28k stars 267 forks source link

相同参数情况下 分布式和单机训练模型精度出现差异 #273

Open LucasTsui0725 opened 1 year ago

LucasTsui0725 commented 1 year ago

使用graphlearn v1.1.0中参考代码,将train_supervised的模型训练部分替换到dist_train的worker任务中测试分布式的监督学习任务。训练数据集选择使用ogbn-arxiv并在分布式训练时将点和边均分成两个文件,分布式训练集群配置为2PS-2Worker,其余代码和模型超参保持不变。结果分布式训练的loss下降到1.6左右开始震荡(单机能下降至1左右),请问这种情况如何解决。

LucasTsui0725 commented 1 year ago

目前我尝试使用单PS和多Worker进行分布式训练,同时在数据载入时使用了全量的数据载入,

    train_table = os.path.join(data_folder, 'train_table')
    test_table = os.path.join(data_folder, 'test_table')
    valid_table = os.path.join(data_folder, 'valid_table')
    node_table = os.path.join(data_folder, node_table)
    edge_table = os.path.join(data_folder, edge_table)
    g = gl.Graph() \
          .node(node_table, node_type=node_type, decoder=gl.Decoder(labeled=True, attr_types=['float'] * args.input_dim, attr_delimiter=":")) \
          .edge(edge_table, edge_type=(node_type, node_type, edge_type), decoder=gl.Decoder(weighted=True), directed=False) \
          .node(train_table, node_type=node_type, decoder=gl.Decoder(weighted=True), mask=gl.Mask.TRAIN) \
          .node(valid_table, node_type=node_type, decoder=gl.Decoder(weighted=True), mask=gl.Mask.VAL) \
          .node(test_table, node_type=node_type, decoder=gl.Decoder(weighted=True), mask=gl.Mask.TEST)

结果基本接近单机训练模式。 在多Ps多Worker的分布式训练中,使用如下方式:

    train_table = os.path.join(data_folder, 'train_table')
    test_table = os.path.join(data_folder, 'test_table')
    valid_table = os.path.join(data_folder, 'valid_table')
    node_table = os.path.join(data_folder, node_table + str(task_index))
    edge_table = os.path.join(data_folder, edge_table + str(task_index))
    g = gl.Graph() \
          .node(node_table, node_type=node_type, decoder=gl.Decoder(labeled=True, attr_types=['float'] * args.input_dim, attr_delimiter=":")) \
          .edge(edge_table, edge_type=(node_type, node_type, edge_type), decoder=gl.Decoder(weighted=True), directed=False) \
          .node(train_table, node_type=node_type, decoder=gl.Decoder(weighted=True), mask=gl.Mask.TRAIN) \
          .node(valid_table, node_type=node_type, decoder=gl.Decoder(weighted=True), mask=gl.Mask.VAL) \
          .node(test_table, node_type=node_type, decoder=gl.Decoder(weighted=True), mask=gl.Mask.TEST)

则产生了上述差异。 请问关于train_table,valid_table和test_table是否需要进行类似变表数据的预拆分,即:

    train_table = os.path.join(data_folder, 'train_table_' + str(task_index))
    test_table = os.path.join(data_folder, 'test_table_' + str(task_index))
    valid_table = os.path.join(data_folder, 'valid_table_' + str(task_index))
    node_table = os.path.join(data_folder, node_table + str(task_index))
    edge_table = os.path.join(data_folder, edge_table + str(task_index))
    g = gl.Graph() \
          .node(node_table, node_type=node_type, decoder=gl.Decoder(labeled=True, attr_types=['float'] * args.input_dim, attr_delimiter=":")) \
          .edge(edge_table, edge_type=(node_type, node_type, edge_type), decoder=gl.Decoder(weighted=True), directed=False) \
          .node(train_table, node_type=node_type, decoder=gl.Decoder(weighted=True), mask=gl.Mask.TRAIN) \
          .node(valid_table, node_type=node_type, decoder=gl.Decoder(weighted=True), mask=gl.Mask.VAL) \
          .node(test_table, node_type=node_type, decoder=gl.Decoder(weighted=True), mask=gl.Mask.TEST)