ValHayot / sea_fuse

GNU General Public License v3.0
0 stars 0 forks source link

Data locality #2

Open glatard opened 5 years ago

glatard commented 5 years ago

Hi @ValHayot, two questions about data locality:

  1. Do you think that your file system would also track the node where a given file has a local or in-memory copy? That would be super useful to schedule tasks where the data is, but it would also be a performance hassle as you would have to keep this registry consistent. Or maybe not, if inconsistencies only lead to extra data transfers but not failures. In any case, that would be a critical registry to design.
  2. What do you think of a poor-person's implementation of locality that would just pre-process the current task graph and cluster tasks with file dependencies as most as possible? It might work in 80% of the cases and might remove the need for an overlay cluster. Expired walltimes wouldn't lead to complete recomputations since intermediate results wouldn't have to be recomputed.
ValHayot commented 5 years ago
  1. was part of the plan, but i don't know if it'd deteriorate performance. Should not been too hard to keep it up-to-date. Anyway, you'd technically have to know where the data was if one node is processing data that is found in two distinct nodes and the data was not found on the shared dir. then it'd have to wait until the file got asynchronously flushed. Maybe symlinks might be useful here.
  2. It's fine as a first draft, but i'm not sure it's a viable solution for the long term. i think you want to schedule tasks to the node where the largest amount of data needed resides. Imaging you have such a scenario:
    
    task_dosomething(file1, file2)