Closed zhihaozheng closed 8 months ago
Make sense! Now only skip the processed ones if the script is processing the entirety of the data on one node.
Can you help me understand this change? I have just discovered that my jobs have been recomputing lots of matches that were already done. In my understanding, the pairnames is only filtered and split up when the slurm jobs (I say slurm job to distinguish from the jobs launched within a node by the process pool) start, so in your example Zhihao, 4 pairs, 2 nodes, after arg_indx, node1 would get pair1 and pair3, and node2 would get pair2 and pair4. node1 would complete pair1, and then move onto pair3. independently, node2 would start would pair2, and then move on to pair4. or if workers > 1, node1 would work on pair1 and pair3 simultaneously, similarly for node2. I don't understand, why should we not still filter already done pairs if we are splitting up the work across nodes? the "already done" calculation only happens once per node, at the start.
I think the issue is when the slurm jobs may not start simultaneously even when submitted together. If some jobs start early and finish some pairs before the rest of the jobs start, then the later jobs are not seeing the same list of unfinished pairs, potentially causing some pairs being skipped. Ryan, you mentioned recomputing matches that were done, but that was not by design and should be skipped. Are you seeing a bug there?
Oh I see now, the existing match is checked for again later, so no, I don't think I am recomputing matches, I thought the pairnames filtering was the only check. I see what you mean about the jobs not starting simultaneously, getting different pairnames list. Ok, makes sense to me now, thank you!
https://github.com/YuelongWu/feabas/blob/6993c20b131e1002e83083c0a29afa78746bb459/scripts/thumbnail_main.py#L416
Here a given index in the
pairnames
depends on what have already been done (.h5), and it got really confusing when many thumbnail matching jobs are running at the same time. I thought it would be more consistent if every command is indexing from the samepairnames
list directly based onimg_list
For example, if there are 4 pairs to compute (each job has a pair; job_1, 2, 3, 4), each pair on one node, across 2 nodes. job_1 and job_2 first go to the 2 nodes; job_3 and job_4 are waiting. By the time job_3 starts running, there are only 2 pairs left on the list, and there is no pair_3 to compute any more in the list, because
pairnames
is a list of what haven't been done (original pair_3, job_4).