Data locality - Githubissues

Hi @ValHayot, two questions about data locality:

Do you think that your file system would also track the node where a given file has a local or in-memory copy? That would be super useful to schedule tasks where the data is, but it would also be a performance hassle as you would have to keep this registry consistent. Or maybe not, if inconsistencies only lead to extra data transfers but not failures. In any case, that would be a critical registry to design.
What do you think of a poor-person's implementation of locality that would just pre-process the current task graph and cluster tasks with file dependencies as most as possible? It might work in 80% of the cases and might remove the need for an overlay cluster. Expired walltimes wouldn't lead to complete recomputations since intermediate results wouldn't have to be recomputed.

was part of the plan, but i don't know if it'd deteriorate performance. Should not been too hard to keep it up-to-date. Anyway, you'd technically have to know where the data was if one node is processing data that is found in two distinct nodes and the data was not found on the shared dir. then it'd have to wait until the file got asynchronously flushed. Maybe symlinks might be useful here.
It's fine as a first draft, but i'm not sure it's a viable solution for the long term. i think you want to schedule tasks to the node where the largest amount of data needed resides. Imaging you have such a scenario:
```
task_dosomething(file1, file2)
```

The task that produced file1 was to be executed on Node 1 and the task that produced file two was executed on Node 2.
Since the dependencies of task_dosomething occur on both, we just pick one node randomly. Let's say Node 1.
At runtime, it turns out file1 is 10kb and file2 is 10G, so we must transfer 10G over the network, when with more information (file sizes), we would have known that scheduling task_dosomething on Node 2 would have been better
```
I guess if you suppose that files passed will always be of equal size, then the poor-person's implementation is fine.
```

ValHayot / sea_fuse