Closed trws closed 1 month ago
@garlick, @jameshcorbett, @milroy, have a look at this. The code path that causes the crash in #1298 should be completely gone, and it should make things both faster and a bit more self-validating since we'll catch unknown resource types at jobspec parsing time rather than later. Unfortunately I haven't been able to repro the issue from #1298, and my attempts to get onto elcap to do it have been foiled by some really gnarly latency from Perth (at pawsey it's sometimes 5 seconds to see a character in ssh 😬 ). If someone could give it a poke and verify I'd really appreciate it.
I think this looks reasonable, what does testing it out on elcap look like? ie what would I need to do
I'm not 100% sure other than build it and run some bulksubmit jobs. @grondo did you ever manage to repro it?
I was never able to reproduce it. It seemed to occur 10% of the time at 128 nodes. The jobspec and batch script are in my homedir at fluxion-crash
.
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 75.3%. Comparing base (
d11a3d5
) to head (852543f
). Report is 4 commits behind head on master.
I'm still debating this change to be honest. The refcounted strings for resource types are showing up as a meaningful cost in comparisons and refcounting, it's not terrible but it's also not great. This solves the issue by making it so that dense storage interners have a notion of being "final" or placed in a state where new strings can't be added anymore. It also adds an interface to "open" a finalized interner for updates, on a per-thread basis, so that we can do things like database updates. To my surprise, this passed all but one of our tests without re-opening the interners, and that last one by re-opening them in update_resource.
The upside is that after this change, and without using it to optimize any of the type maps or anything like that, we get another 20-30% improvement in throughput for some of my tests. I've seen as much as 250 jobs/s. The downside is that the interface is a little bit odd for this, and at the moment the only way we have to signal the thing is closed is to throw an exception. Given that the only source of new resource types after init should be jobspec though, which already throws exceptions for any parse errors (which is basically what this would be, if the resource type doesn't exist it can't match) I'm not sure how bad that is. Would love input on this. Might be easier to use if it worked with a lock/unlock pair so lock_guard would work, not sure.
I think this fixes #1298