L3tum / HardwareInformation

.NET Standard Cross-Platform Hardware Information Gatherer
MIT License
56 stars 11 forks source link

Possible deadlock/freeze in IntelInformationProvider.GetNumberOfPhysicalCores() #32

Closed oysteinkrog closed 1 year ago

oysteinkrog commented 2 years ago

We use your library in our commercial application. Recently we received a bug report that the application was freezing on startup. In our investigation we found that the process is stuck in the IntelInformationProvider.GetNumberOfPhysicalCores() method: https://github.com/L3tum/HardwareInformation/blob/c3fbe44d09bd52b219db2dd4efe6c181205e2ff5/HardwareInformation/Providers/IntelInformationProvider.cs#L143 The customer is using a very recent CPU (Intel Alder Lake), so I am wondering if it is possible that there is a bug in this method that is now only exposed due to the new CPU? Please see the attached CPU-Z report: DESKTOP-63MBB0H.txt I can provide a dump of the process if needed :) (Please contact me at oystein@initialforce.com)

oysteinkrog commented 2 years ago

Screenshot of VS2022 showing the dump at the deadlock site: bilde

L3tum commented 2 years ago

Hey, thank you for the report.

I've been slowly rewriting this library to be more flexible but I'm probably going to make a simple update if I get ADL to work. Do you have any test systems of ADL? I don't have a hybrid system myself. I'm a bit curious if Lakefield just simply never hit many customers so it didn't expose bugs like this or if it's just ADL that does some fuckery.

L3tum commented 2 years ago

Hey, can you test with v5.0.1? I think it should work, and I've added a guard to stop after iterating for int.MaxValue so at the very least the freeze should be gone.

I'm still not quite sure what was wrong with the code though, in theory at some point the Intel CPU should return 0 as a sign that there's no more Cores to iterate through, and yet it somehow didn't I guess. I doubt it's an errata mind you, but it's weird nonetheless.

oysteinkrog commented 2 years ago

@L3tum Thanks! We will test on the affected customers system as soon as possible (we need to arrange for remote support session etc). I will update you on the results :)

oysteinkrog commented 2 years ago

@L3tum Our support team was able to test 5.0.1 on the affected customers system, and now it no longer deadlocks, but it takes a long time for our application to start. We have not had a chance to grab another dump yet, but I suspect the reason for the slow start is probably that it iterates through the entire int.MaxValue range (maxIterations variable). Did you consider a lower value for maxIterations?

L3tum commented 2 years ago

Thanks for the reply. I did consider a lower limit, but was hesitant as it essentially introduces a Y2K (etc.) dependency into it. That said, I don't think that there's probably going to be single-socket systems with that many cores. I think 1024 or something would be a good limit as well.

It's somewhat weird still that the code as is doesn't quit beforehand. I've read through the Intel Manual again and they specifically say this is the way (checking for type to be 0). It's also weird that it takes a long time, considering it's 2 instructions and an if-check (plus a bit of boilerplate). Can you give me a timeframe of how long the startup took?

Also, I'll be doing a 5.0.2 release later today or this week and limit it to a more reasonable 1024 or some such.

L3tum commented 2 years ago

@oysteinkrog I've released 5.0.2 now, limiting the iteration count to 256. I'm probably going to be reworking the library a bit, because the current model just simply can't really work with heterogeneous architectures. Hopefully this would fix the underlying issues