Closed leepengcheng closed 3 years ago
@leepengcheng Sorry for the late reply. Would you like to provide more info, like
dgl-graphsage-partitioner
Pod work as expected? And the final status was Completed
?dgl-graphsage-partitioner
Pod log.dgl-graphsage-worker
Pods create at first as expected, but failed or evicted before the main container of dgl-graphsage-launcher
running?The regular lifecycle is like
dgl-graphsage-partitioner
Pod has a running
status, which means the main container of the Pod is running.dgl-graphsage-launcher
is running.dgl-graphsage-partitioner
Pod has a completed
status.dgl-graphsage-worker
Pods are intended to be created.dgl-graphsage-launcher
is running.dgl-graphsage-launcher
Pod has a running
status, which means the main container of the Pod is running.dgl-graphsage-worker
Pods have a running
status.@ryantd
the ogbn dataset is too slow,so I replace it with cora_v2 dataset. dgl-graphsage-partitioner pod has a Error
status
dgl-graphsage-partitioner pod log:
Phase 1/5: load and partition graph
----------
[14:07:14] /opt/dgl/src/graph/transform/metis_partition_hetero.cc:73: Partition a graph with 2708 nodes and 10556 edges into 2 parts and get 323 edge cuts
Using backend: pytorch
/usr/local/lib/python3.6/site-packages/dgl/data/utils.py:285: UserWarning: Property dataset.num_labels will be deprecated, please use dataset.num_classes instead.
warnings.warn('Property {} will be deprecated, please use {} instead.'.format(old, new))
Partition arguments: Namespace(balance_edges=True, balance_train=True, dataset_url='http://snap.stanford.edu/ogb/data/nodeproppred/products.zip', graph_name='graphsage', num_parts=2, output='/dgl_workspace/dataset', part_method='metis', rel_data_path='dataset', undirected=False, workspace='/dgl_workspace')
NumNodes: 2708
NumEdges: 10556
NumFeats: 1433
NumClasses: 7
NumTrainingSamples: 140
NumValidationSamples: 500
NumTestSamples: 1000
Done loading data from cached files.
load 'ogbn-products' takes 0.141 seconds
|V|=2708, |E|=10556
train: 140, valid: 500, test: 1000
Convert a graph into a bidirected graph: 0.003 seconds
Construct multi-constraint weights: 0.006 seconds
Metis partitioning: 0.003 seconds
Reshuffle nodes and edges: 0.041 seconds
Split the graph: 0.001 seconds
Construct subgraphs: 0.005 seconds
part 0 has 1611 nodes and 1394 are inside the partition
part 0 has 5353 edges and 5353 are inside the partition
part 1 has 1513 nodes and 1314 are inside the partition
part 1 has 5203 edges and 5203 are inside the partition
Save partitions: 0.193 seconds
There are 10556 edges in the graph and 0 edge cuts for 2 partitions.
############# partition_graph ###############
----------
Phase 1/5 finished
Phase : 2 seconds
Total : 2 seconds
----------
Phase 2/5: deliver partitions
----------
Launch arguments: Namespace(cmd_type='copy_batch_container', container='watcher-loop-partitioner', ip_config='/etc/dgl/leadfile', num_parts=None, num_samplers=0, num_server_threads=1, num_servers=None, num_trainers=None, part_config=None, source_file_paths='/dgl_workspace/dataset', target_dir='/dgl_workspace', worker_chief_index=0, workspace='/dgl_workspace'), []
Traceback (most recent call last):
File "tools/launch.py", line 278, in
main()
File "tools/launch.py", line 250, in main
run_cp_container(args)
File "tools/launch.py", line 98, in run_cp_container
for pod_info in get_ip_host_pairs(args.ip_config):
File "tools/launch.py", line 62, in get_ip_host_pairs
raise RuntimeError("Format error of ip_config.")
RuntimeError: Format error of ip_config.
----------
Phase 2/5 error raised
@ryantd
i solve this problem by 'time.sleep(10)',because the partitioner pod finished too quickly(i mouted local cora dataset)😱
leepengcheng
请问是怎么挂载数据的,我现在通过nfs挂载数据和代码,在partitioner节点找不到,怎么在partitioner节点挂载呢
leepengcheng
请问是怎么挂载数据的,我现在通过nfs挂载数据和代码,在partitioner节点找不到,怎么在partitioner节点挂载呢
apiVersion: qihoo.net/v1alpha1
kind: DGLJob
metadata:
name: dgl-graphsage
namespace: dgl-operator
spec:
partitionMode: DGL-API
cleanPodPolicy: Running
dglReplicaSpecs:
Launcher:
replicas: 1
template:
spec:
containers:
- image: k3d-sfai-registry:39203/examples:graphsage-dist
name: dgl-graphsage
imagePullPolicy: IfNotPresent
resources:
requests:
ephemeral-storage: 10Gi
limits:
ephemeral-storage: 15Gi
command:
- dglrun
args:
- --graph-name
- graphsage
# partition
- --partition-entry-point
- code/load_and_partition_graph.py
- --num-partitions
- "2"
- --balance-train
- --balance-edges
- --dataset-url
- http://snap.stanford.edu/ogb/data/nodeproppred/products.zip
# training
- --train-entry-point
- code/train_dist.py
- --num-epochs
- "1"
- --batch-size
- "1000"
- --num-trainers
- "1"
- --num-samplers
- "4"
- --num-servers
- "1"
Worker:
replicas: 2
template:
spec:
containers:
- image: k3d-sfai-registry:39203/examples:graphsage-dist
name: dgl-graphsage
imagePullPolicy: IfNotPresent
resources:
requests:
memory: 15Gi
cpu: "2"
ephemeral-storage: 10Gi
limits:
memory: 20Gi
cpu: "4"
ephemeral-storage: 15Gi
volumeMounts:
- mountPath: /root/.dgl/ogbn_products
name: ogbn
- mountPath: /dgl_workspace/code
name: code
volumes:
- name: ogbn
hostPath:
path: /home/dgl/data/ogbn_products
type: Directory
- name: code
hostPath:
path: /home/dgl/dgl/examples/GraphSAGE_dist/code
type: Directory
你可以参考一下,我用的K3S搭的,所以要挂载两次,注意宿主机的文件路径和容器内部的路径的差别
error of dgl-graphsage-launcher pod
error of dgl-operator pod