Open adammoody opened 2 years ago
The (kind of) good news is that scavenge does appear to work, but not optimally when the cache is shared. When the cache is shared, all instances of scr_copy
will flush the same set of files from the cache to the parallel file system.
Here is a conversation with good notes between me and @adammoody concenring the topic:
[10:36 AM] McFadden, Marty I was able to confirm that each rank that is using a shared cache happily re-copies all of the files from the cache to the shared file system. Besides performance, things can probably get mixed up when/if copied files are deleted from the cache after they are copied.
[10:40 AM] McFadden, Marty Is there a specific reason for using pdsh instead of MPI with scr_copy? I understand that there would be changes required to make scr_copy an MPI program and that there would need to be additional work to distribute work if/when the number of ranks being run are less than the number of ranks used in the initial allocation, but are there other reasons? I suppose that another option would be to pass along an index number to each of the instantiations of scr_copy and still use pdsh (I'm not sure if this is possible, but it might be less work)
[12:21 PM] Moody, Adam T. The reason for using pdsh now is the assumption that we typically run scavenge due to a node failure, and some systems don't allow one to launch MPI jobs in an allocation after a node has failed. Also there is a decent chance that if one node has failed that we know about, there could be other failed nodes that we haven't yet discovered as failed. Either of those situations would cause problems for an MPI launch.
Having said that, I think we should actually look at making scr_copy an MPI job. It would be really helpful in this case. Also, systems today are more tolerable about running MPI in an allocation with failed nodes, and our node failure detection is pretty decent.
I think I'd like to keep the pdsh + serial scr_copy method as a plan B in our back pocket in case there is some situation where we need it. Maybe that just means keeping a branch with the code somewhere if not actually installing it. That wouldn't help much with shared cache, but it'd be a nice fallback option for node-local storage users in case they are on a system where the above limitations show up.
[12:58 PM] McFadden, Marty
I assume that the rank # is associated somehow with each file that is in the cache. Is that true? Here is what I see from a 3 node scavenge:$ tree -a scr.dataset.6
scr.dataset.6
|-- .scr
| |-- filemap_0
| |-- filemap_1
| |-- filemap_2
| |-- reddesc.er.0.redset
| |-- reddesc.er.0.single.grp_1_of_3.mem_1_of_1.redset
| |-- reddesc.er.1.redset
| |-- reddesc.er.1.single.grp_2_of_3.mem_1_of_1.redset
| |-- reddesc.er.2.redset
| |-- reddesc.er.2.single.grp_3_of_3.mem_1_of_1.redset
| |-- reddesc.er.er
| |-- reddesc.er.shuffile
| |-- reddescmap.er.0.redset
| |-- reddescmap.er.0.single.grp_1_of_3.mem_1_of_1.redset
| |-- reddescmap.er.1.redset
| |-- reddescmap.er.1.single.grp_2_of_3.mem_1_of_1.redset
| |-- reddescmap.er.2.redset
| |-- reddescmap.er.2.single.grp_3_of_3.mem_1_of_1.redset
| |-- reddescmap.er.er
| -- reddescmap.er.shuffile |-- rank_0.ckpt |-- rank_1.ckpt
-- rank_2.ckptCurrently, each scr_copy will copy all of these from the shared cache.
I wonder whether there is anything that scr_copy can read in from the pattern of filemape_X to indicate whether the .scr directory is on a shared cache?
I think that we want to avoid having N copies of scr_copy each copy over the same set of files from a shared cache even when it is being run serially
[1:10 PM] Moody, Adam T. Right, we don't want every node to copy every file.
[1:11 PM] McFadden, Marty I mean even the serial scr_copy. Is there any way that the script could detect the pattern and then only launch one scr_copy?
[1:12 PM] McFadden, Marty It would be harder with caches that are only partially shared. But in the case of a globally shared cache, it seems like there might be some way?
[1:13 PM] McFadden, Marty (note: I am only talking about the serial scr_copy at this point)
[1:16 PM] Moody, Adam T. I can't recall offhand if there is something in the filemap that could be used. The other file that comes into play is the "flush" file. That's the file that SCR uses to track which datasets are in cache. We could perhaps add a note about the cache being global in there. However, that starts to special case the global cache, so I think we'd just want to look at that as a temporary solution.
[1:16 PM] McFadden, Marty true. hmm
It seems like we would need to have a new (MPI) program that will build a globally viewable file map
With file names and rank lists for each file or maybe that is what scr_copy+mpi would do
But what you've said about MPI (sometimes) not working within a broken allocation concerns me. This is the only time that this stuff will be run right?
[1:22 PM] Moody, Adam T. That and in any case where the user's job stopped before calling SCR_Finalize, e.g., bad use of SCR API or software bugs like segfault, divide by zero, etc.
Something a bit more general than just listing that the cache is global, would be to describe the storage topology somehow say in the flush file or somewhere else. For example, we could list the nodes that are designated as "storage leaders", i.e., nodes that had a process that was rank=0 of its store descriptor communicator. For node local storage, we'd list every compute node. For global cache, we'd just list the first node. Then we could modify scavenge to only launch scr_copy processes on the storage leader nodes. Or taking that a step further for higher copy performance, we could spell out full storage groups, launch scr_copy on all nodes, and then divide files up among members of each storage group.
Taking the long view, ultimately we want all/most/many of the surviving compute nodes to help during the copy operation in the scavenge. When writing a checkpoint, each compute node will have saved some data so that the dataset size scales as O(N), where N = number of compute nodes that were used in the job. We want the scavenge to scale similarly. For example, launching a single scr_copy on a single node would not be scalable since then we have a single compute node responsible for copying O(N) data, like 1 node copying data saved by 4000 nodes.
The problem with a global cache is that we have to figure out a way to coordinate that effort
[1:47 PM] McFadden, Marty Would associating a string in the globally managed flush.scr file that contained the storage leader node name and the names of all member nodes for each cached file be scalable?
If it could be, then perhaps this could be made to work with pdsh/scr_copy as each scr_copy could open a read-only copy of the flush.scr file and could do its work based upon which node it was running on
Or perhaps the scr_scavenge script could create a copy of the flush.scr file that it does some post job-processing on. Assigns nodes to files and such.
I say this because scr_scavenge knows which nodes are left to do any work
[1:52 PM] Moody, Adam T. Yeah, I think something like that would be good.
[1:54 PM] McFadden, Marty The only scalability piece that might be left on the table is for REALLY large files. But this might be an enhancement we can add to cause the post processing to state that a set of nodes should flush a specific file.
[2:02 PM] Moody, Adam T. Yep, I think we should figure out how to segment up shared files as part of this. The goal is to find something that works for anything from a file-per-process to a single shared file. Or if we have to pick one to start with, we don't want to come up with a design that prevents one from extending the solution to work on shared files. It's easy enough to assign segments of a shared file based on rank. I think the harder part is knowing when it's safe for a rank to start writing its portion into the file.
[2:04 PM] McFadden, Marty I think that the only tricky part is creating the file right? If so, would it be okay to have the script itself create the file? (I've no idea how well that would scale).
I also have a silly HPC question. How does one deail with 4000 nodes if they are are uniquely named with various strings sizes and each has very few (if any) similar characters (like "Freddy2" and "Bill3")? Or is that simply not allowed and administrators would NEVER let that happen?
I'm assumng that they are named in a sane way like "rzgenie01-NNN"
[2:08 PM] Moody, Adam T. I don't think we want the main scavenge script to create the file(s), because that wouldn't scale well in the other extreme of a file-per-process. That script could end up needing to create tens of thousands of files. I was thinking we might run two phases of pdsh commands. The first would launch a job that just create/truncate files, and the second would copy data. That would allow for multiple nodes to help create the files if needed.
Another challenge that comes to mind here is what happens if there is some undetected failed node that is repsonsible for some portion of the file. We'll need a way to know that portion failed to copy. I guess we could gather and process return codes from all nodes that we launch on, or maybe we have each of those processes write some additional known data somewhere else that we can check later.
LLNL uses well-formed node names that can be grouped together. However, there are HPC centers that used node names that really look like random strings (no simple way to compress them).
[2:11 PM] McFadden, Marty Agreed. These are good notes. I will be adding them to the ticket associated with this work.
Whoa! Really!?!? (node names). Yikes!! Do I need to worry about that now? Do our instantiations of pdsh work with 10's of 1000's of those?
[2:13 PM] Moody, Adam T. We used to assume LLNL SLURM-range syntax in SCR, but we dropped that to just list hosts as comma delimited strings.
[2:14 PM] McFadden, Marty wow! that becomes a shell limitation at some point right?
[2:14 PM] Moody, Adam T. Yeah, pdsh itself works with long lists. Though I suppose we might need to invoke it with xargs or something from our side.
[2:15 PM] McFadden, Marty Okay. I'll take you word for it. But yowsa
Yeah, I've run into odd names in SCR and in feedback on other projects, like:
https://github.com/LLNL/scr/pull/95
https://github.com/LLNL/mpiGraph/issues/5
Rather than just a cluster name and number like we do, some sites encode physical location information into their node names, like row number, rack number, rack height, slot number, etc. That helps their ops team find a node just given the name, but it leads to names like "row5-rack6-slot10"
The scavenge operation assumes files are cached in node local storage. After the run stops, the scavenge script launches the scr_copy executable on each compute node via pdsh. On each node, this executable runs as a serial process. It reads the SCR cache index in the control directory (/dev/shm) on each node to get the cache location for a specified dataset. It then scans the cache directory and copies every application file and every SCR metadata/redundancy file from cache to the prefix directory.
When cache is a global file system, this will need to be fixed. I'm not sure offhand if this will only copy files through a single compute node, or whether every compute node will try to copy every file and step on each other. Either case is bad. Ideally, we'll want to parallelize this copy operation, so that multiple compute nodes help in the copy, but they each copy a distinct set of files.
If the redundancy scheme is SINGLE, perhaps we can skip copying the redundancy data files?
Also, the application file may be a large, single file. To be efficient, we'd need to transfer that in parallel by assiging different segments to different compute nodes. Is there a way to do that with a bunch of independent scr_copy tasks or do we need to turn scavenge into an MPI job?