The various SplitCounters() method implementation have a loop that is susceptible to hangs given bad GPU data. E.g., for the GPU I was adding, there was a goof in the auto generated files that left a counter group with a max-counters of zero. When I ran the color cube app, it hung in GpaSplitCountersConsolidated::SplitSingleCounter() because CanCounterBeAdded() always returns false in that case, and so it just keeps trying to add it to the next pass, which will go on for billions of iterations.
while (done_allocating_counter == false)
{
....
}
The hang was time consuming to debug. I think we can probably pick some reasonable maximum pass count which is impractical in the real world, and error out (with a log statement) if the looping exceeds that.
I've seen this problematic pattern in at least one other method: GpaContextCounterMediator::ScheduleCounters()
There may be more.
The various SplitCounters() method implementation have a loop that is susceptible to hangs given bad GPU data. E.g., for the GPU I was adding, there was a goof in the auto generated files that left a counter group with a max-counters of zero. When I ran the color cube app, it hung in GpaSplitCountersConsolidated::SplitSingleCounter() because CanCounterBeAdded() always returns false in that case, and so it just keeps trying to add it to the next pass, which will go on for billions of iterations.
The hang was time consuming to debug. I think we can probably pick some reasonable maximum pass count which is impractical in the real world, and error out (with a log statement) if the looping exceeds that.
I've seen this problematic pattern in at least one other method: GpaContextCounterMediator::ScheduleCounters() There may be more.