Qihoo360 / dgl-operator

The DGL Operator makes it easy to run Deep Graph Library (DGL) graph neural network training on Kubernetes
Apache License 2.0
44 stars 6 forks source link

No such file or directory: '/etc/dgl/hostfile' #15

Closed leepengcheng closed 3 years ago

leepengcheng commented 3 years ago

deploy examples/v1alpha1/GraphSAGE_dist.yaml:

error of dgl-graphsage-launcher pod

Phase 3/5: dispatch partitions
----------
Traceback (most recent call last):
File "tools/dispatch.py", line 102, in 
main()
File "tools/dispatch.py", line 44, in main
with open(args.ip_config) as f:
FileNotFoundError: [Errno 2] No such file or directory: '/etc/dgl/hostfile'
----------
Phase 3/5 error raised

error of dgl-operator pod

021-08-14T09:45:48.716Z INFO    controllers.DGLJob  Finished reconciling job    {"dgljob": "dgl-operator/dgl-graphsage", "dgl-operator/dgl-graphsage": "80.81µs"}
2021-08-14T09:45:48.722Z    ERROR   controllers.DGLJob  unable to fetch DGLJob  {"dgljob": "dgl-operator/dgl-graphsage", "error": "DGLJob.qihoo.net \"dgl-graphsage\" not found"}
github.com/go-logr/zapr.(*zapLogger).Error
/go/pkg/mod/github.com/go-logr/zapr@v0.2.0/zapr.go:132
github.com/Qihoo360/dgl-operator/controllers.(*DGLJobReconciler).Reconcile
/workspace/controllers/dgljob_controller.go:115
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.8.2/pkg/internal/controller/controller.go:297
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.8.2/pkg/internal/controller/controller.go:252
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.2
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.8.2/pkg/internal/controller/controller.go:215
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1
/go/pkg/mod/k8s.io/apimachinery@v0.20.5/pkg/util/wait/wait.go:185
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
/go/pkg/mod/k8s.io/apimachinery@v0.20.5/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
/go/pkg/mod/k8s.io/apimachinery@v0.20.5/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
/go/pkg/mod/k8s.io/apimachinery@v0.20.5/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext
/go/pkg/mod/k8s.io/apimachinery@v0.20.5/pkg/util/wait/wait.go:185
k8s.io/apimachinery/pkg/util/wait.UntilWithContext
/go/pkg/mod/k8s.io/apimachinery@v0.20.5/pkg/util/wait/wait.go:99
ryantd commented 3 years ago

@leepengcheng Sorry for the late reply. Would you like to provide more info, like

  1. Did dgl-graphsage-partitioner Pod work as expected? And the final status was Completed?
  2. Please provide dgl-graphsage-partitioner Pod log.
  3. Did dgl-graphsage-worker Pods create at first as expected, but failed or evicted before the main container of dgl-graphsage-launcher running?

    The regular lifecycle is like

  4. Partitioning phase
    1. dgl-graphsage-partitioner Pod has a running status, which means the main container of the Pod is running.
    2. A initContainter (partitioning finish tracker) of dgl-graphsage-launcher is running.
  5. Partitioning finished
    1. dgl-graphsage-partitioner Pod has a completed status.
    2. The initContainter (partitioning finish tracker) is completed.
    3. dgl-graphsage-worker Pods are intended to be created.
    4. A initContainter (worker readiness tracker) of dgl-graphsage-launcher is running.
  6. Training phase
    1. dgl-graphsage-launcher Pod has a running status, which means the main container of the Pod is running.
    2. All dgl-graphsage-worker Pods have a running status.
    3. Launching the distributed training...
leepengcheng commented 3 years ago

@ryantd

the ogbn dataset is too slow,so I replace it with cora_v2 dataset. dgl-graphsage-partitioner pod has a Error status

dgl-graphsage-partitioner pod log:

Phase 1/5: load and partition graph
----------
[14:07:14] /opt/dgl/src/graph/transform/metis_partition_hetero.cc:73: Partition a graph with 2708 nodes and 10556 edges into 2 parts and get 323 edge cuts
Using backend: pytorch
/usr/local/lib/python3.6/site-packages/dgl/data/utils.py:285: UserWarning: Property dataset.num_labels will be deprecated, please use dataset.num_classes instead.
  warnings.warn('Property {} will be deprecated, please use {} instead.'.format(old, new))
Partition arguments: Namespace(balance_edges=True, balance_train=True, dataset_url='http://snap.stanford.edu/ogb/data/nodeproppred/products.zip', graph_name='graphsage', num_parts=2, output='/dgl_workspace/dataset', part_method='metis', rel_data_path='dataset', undirected=False, workspace='/dgl_workspace')
  NumNodes: 2708
  NumEdges: 10556
  NumFeats: 1433
  NumClasses: 7
  NumTrainingSamples: 140
  NumValidationSamples: 500
  NumTestSamples: 1000
Done loading data from cached files.
load 'ogbn-products' takes 0.141 seconds
|V|=2708, |E|=10556
train: 140, valid: 500, test: 1000
Convert a graph into a bidirected graph: 0.003 seconds
Construct multi-constraint weights: 0.006 seconds
Metis partitioning: 0.003 seconds
Reshuffle nodes and edges: 0.041 seconds
Split the graph: 0.001 seconds
Construct subgraphs: 0.005 seconds
part 0 has 1611 nodes and 1394 are inside the partition
part 0 has 5353 edges and 5353 are inside the partition
part 1 has 1513 nodes and 1314 are inside the partition
part 1 has 5203 edges and 5203 are inside the partition
Save partitions: 0.193 seconds
There are 10556 edges in the graph and 0 edge cuts for 2 partitions.
############# partition_graph ###############
----------
Phase 1/5 finished
Phase : 2 seconds
Total : 2 seconds
----------
Phase 2/5: deliver partitions
----------
Launch arguments: Namespace(cmd_type='copy_batch_container', container='watcher-loop-partitioner', ip_config='/etc/dgl/leadfile', num_parts=None, num_samplers=0, num_server_threads=1, num_servers=None, num_trainers=None, part_config=None, source_file_paths='/dgl_workspace/dataset', target_dir='/dgl_workspace', worker_chief_index=0, workspace='/dgl_workspace'), []
Traceback (most recent call last):
  File "tools/launch.py", line 278, in 
    main()
  File "tools/launch.py", line 250, in main
    run_cp_container(args)
  File "tools/launch.py", line 98, in run_cp_container
    for pod_info in get_ip_host_pairs(args.ip_config):
  File "tools/launch.py", line 62, in get_ip_host_pairs
    raise RuntimeError("Format error of ip_config.")
RuntimeError: Format error of ip_config.
----------
Phase 2/5 error raised
leepengcheng commented 3 years ago

@ryantd

i solve this problem by 'time.sleep(10)',because the partitioner pod finished too quickly(i mouted local cora dataset)😱

allendred commented 2 years ago

leepengcheng

请问是怎么挂载数据的,我现在通过nfs挂载数据和代码,在partitioner节点找不到,怎么在partitioner节点挂载呢

leepengcheng commented 2 years ago

leepengcheng

请问是怎么挂载数据的,我现在通过nfs挂载数据和代码,在partitioner节点找不到,怎么在partitioner节点挂载呢

apiVersion: qihoo.net/v1alpha1
kind: DGLJob
metadata:
  name: dgl-graphsage
  namespace: dgl-operator
spec:
  partitionMode: DGL-API
  cleanPodPolicy: Running
  dglReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        spec:
          containers:
          - image: k3d-sfai-registry:39203/examples:graphsage-dist
            name: dgl-graphsage
            imagePullPolicy: IfNotPresent
            resources:
              requests:
                ephemeral-storage: 10Gi
              limits:
                ephemeral-storage: 15Gi
            command:
            - dglrun
            args:
            - --graph-name
            - graphsage
            # partition
            - --partition-entry-point
            - code/load_and_partition_graph.py
            - --num-partitions
            - "2"
            - --balance-train
            - --balance-edges
            - --dataset-url
            - http://snap.stanford.edu/ogb/data/nodeproppred/products.zip
            # training
            - --train-entry-point
            - code/train_dist.py
            - --num-epochs
            - "1"
            - --batch-size
            - "1000"
            - --num-trainers
            - "1"
            - --num-samplers
            - "4"
            - --num-servers
            - "1"
    Worker:
      replicas: 2
      template:
        spec:
          containers:
          - image: k3d-sfai-registry:39203/examples:graphsage-dist
            name: dgl-graphsage
            imagePullPolicy: IfNotPresent
            resources:
              requests:
                memory: 15Gi
                cpu: "2"
                ephemeral-storage: 10Gi
              limits:
                memory: 20Gi
                cpu: "4"
                ephemeral-storage: 15Gi
            volumeMounts:
              - mountPath: /root/.dgl/ogbn_products
                name: ogbn
              - mountPath: /dgl_workspace/code
                name: code

          volumes:
          - name: ogbn
            hostPath:
              path: /home/dgl/data/ogbn_products
              type: Directory
          - name: code
            hostPath:
              path: /home/dgl/dgl/examples/GraphSAGE_dist/code
              type: Directory

你可以参考一下,我用的K3S搭的,所以要挂载两次,注意宿主机的文件路径和容器内部的路径的差别