alibaba / clusterdata

cluster data collected from production clusters in Alibaba for cluster management research
1.57k stars 405 forks source link

请问PAI数据中instance 表相同start_time在相同机器执行的instance,是代表多个任务同时在一台机器上运行吗 #185

Closed Jackjiayou closed 1 year ago

Jackjiayou commented 1 year ago
image

请问如图inst_id = 46442990f9c5da07bb4c399cb5e4e8ab3372ec4c995eccd8063af98a9ef6的数据,这种情况是论文中说的GPU共享吗?

qzweng commented 1 year ago

Hi @Jackjiayou ,

Yes, the instance with the same machine value has run on the same machine; if their start_time -- end_time has some overlap, they have share the machine for a while.

Regarding the GPU sharing, it further requests the gpu_name to be the same value on that machine. The gpu_name can be retrieved from table pai_sensor_table, the following entries are for your reference:

In [10]: dfs[dfs.inst_id==inst_id_4644][['worker_name','machine','gpu_name']].sort_values('machine')
Out[10]: 
                                              worker_name                   machine      gpu_name
346867  68557bf274d2d0ad2ad97b1c73c223082bdb377411a6f1...  07a757904c2974820f7f9dce  /dev/nvidia3
346871  9ecd28b62b77cf3c52a9ab4218f18f237f7741a91c9678...  081398694cba03a36ebd1280  /dev/nvidia0
346856  787cace50526cef10c985fac21f27800a0fd8e10395519...  0a77ce47d2dc5f1a13fa9075  /dev/nvidia1
346869  c92b1fadc19f7df8c390807e3e772df030acf8d32c3f26...  1276c88236bd5b94e9d0021a  /dev/nvidia5
346854  e41c40c968860d4fcb5e81470ecf3b5ef2804a89df1e32...  12bcc4fceea93a30d7d0f324  /dev/nvidia5
346870  8953bcfaa1ae98467552e54d45c369f98aaeb3e811b036...  1465a37f156f80e0687d8fff  /dev/nvidia2
346841  87c15f2fc929e1d6e5234adbd179b83895e8698838ec28...  16b3cec68193e8b041dcd447  /dev/nvidia5
346864  b69f3fa916aace7329a530e5f64b9cb07a4e12db8a2869...  2031d5d4fcebfbd4fbab58c9  /dev/nvidia2
346843  554f4f40e33a8332c8a17ebc2f55bae4c99c5ec6da69bd...  2daafba3a48984f15d4f1325  /dev/nvidia1
......
Jackjiayou commented 1 year ago

@qzweng 请问PAI数据中task 或者instance的执行有先后顺序吗,task或者instance之间是否有依赖关系?