The-OpenROAD-Project / OpenSTA

OpenSTA engine
GNU General Public License v3.0
389 stars 170 forks source link

NaN in report_power results #60

Closed gadfort closed 3 years ago

gadfort commented 3 years ago

I've been running into an issue with the report_power command. Some instances appear to have NaN for the internal power (Combinational). It appears to be traceable to the Power.cc file, on line 852 there is a comparison with 0.0 (which is what the activity in the to_pin is), but that comparison appears to result in 0.0 != 0.0. This is not 100% unexpected since it's a float comparison. https://github.com/The-OpenROAD-Project/OpenSTA/blob/0aa454a01906266e379539acb1af49e354ccaafd/search/Power.cc#L852 If this comparison uses the fuzzyZero from Fuzzy.hh, the issue does away, as it correctly identifies the activity as 0 instead of taking the else branch causing a #/0.0.

There is a second place in the Power.cc file that also makes a comparison with 0.0 on line 817. https://github.com/The-OpenROAD-Project/OpenSTA/blob/0aa454a01906266e379539acb1af49e354ccaafd/search/Power.cc#L817

I'm not sure if this was your intended use of fuzzyZero, but I saw it was there, so it was easy to use.

Please let me know if you need more information.

Thanks, Peter

jjcherry56 commented 3 years ago

I would like a test case. Always. divide by zero only fails when the divisor is zero, and fuzzy isn't called for. I suspect the real issue is that the duty is also zero and is not being tested.

gadfort commented 3 years ago

I will try to recreate the issue in the nangate45 pdk (unless there is a different library you prefer to work with), I'm unable to share the exact case I'm working with at the moment. That might take me a little bit. To clarify a little more, the issue is that on line 855 there is a div by 0 (because to_activity.activity() is 0.0, but the if statement is saying false), which results in nan (which is what the if statement on line 852 is trying to protect from I assume).

turning on debugging (set_debug power 2) this is the output for a single cell, this is the output. power: internal output g379595/Y (3-input NOR) power: cap = 0.500 power: when act/ns duty wgt energy power power: A -> Y 0.00 1.00 -nan 1.97e-16 -nan VDD power: when act/ns duty wgt energy power power: A -> Y 0.00 1.00 -nan -3.92e-19 -nan VO power: when act/ns duty wgt energy power power: B -> Y 0.00 0.00 -nan 2.90e-16 -nan VDD power: when act/ns duty wgt energy power power: B -> Y 0.00 0.00 -nan -6.14e-18 -nan VO power: when act/ns duty wgt energy power power: C -> Y 0.00 -nan -nan 3.65e-16 -nan VDD power: when act/ns duty wgt energy power power: C -> Y 0.00 -nan -nan -1.17e-17 -nan VO

Tracing through the values, C->Y, is the where the first nan appears and it's from line 855.

jjcherry56 commented 3 years ago

I understand it you may not be able to share the test case. Since you seem like you know what you are doing, I am attaching a patch file (patch -p 1 < nan.txt) that may shed some light on the issue. set a breakpoint at the print in checkGarlicNan to find the first NaN that shows up. This assumes it is caused by divide by zero, which has been the case in the past (hence the ifs to protect divisions). nan.txt

gadfort commented 3 years ago

I wasn't able to create a test-case that replicated the issue. Thanks for the patch, I applied it and was able to trace the 4 "gotta nan" outputs. I've attached the output below. All originate from findInputDuty, I've called out where they are called from in the output below: power: internal output g379595/Y (3-input NOR) power: cap = 0.500 gotta nan / 0.0000000000 (call to findInputDuty on line 775) gotta nan / 0.0000000000 (call to findInputDuty on line 775) power: when act/ns duty wgt energy power gotta nan / -nan power: A -> Y 0.00 1.00 -nan 1.97e-16 -nan VDD power: when act/ns duty wgt energy power gotta nan / -nan power: A -> Y 0.00 1.00 -nan -3.92e-19 -nan VO power: when act/ns duty wgt energy power gotta nan / -nan power: B -> Y 0.00 0.00 -nan 2.90e-16 -nan VDD power: when act/ns duty wgt energy power gotta nan / -nan power: B -> Y 0.00 0.00 -nan -6.14e-18 -nan VO gotta nan / 0.0000000000 (call to findInputDuty on line 785) power: when act/ns duty wgt energy power gotta nan / -nan power: C -> Y 0.00 -nan -nan 3.65e-16 -nan VDD gotta nan / 0.0000000000 (call to findInputDuty on line 785) power: when act/ns duty wgt energy power gotta nan / -nan power: C -> Y 0.00 -nan -nan -1.17e-17 -nan VO

I took a look at the value of to_activity.activity(), which is 2.0302513946e-34 (basically 0.0, but not quite.). Looking a little closer at the calculation from_activity.activity() / to_activity.activity() * duty1, it turns out that from_activity.activity() / to_activity.activity() == inf and when multiplied by duty1 == 0.0 becomes the NaN.

I was able to prevent the inf/overflow by rearranging the calculation to (from_activity.activity() * duty1) / to_activity.activity(), here the calculation becomes (0.327 * 0.0) / 2e-34 == 0.0/2e-34 == 0.0, whereas before it was inf * 0.0 == NaN.

So if changing the if statement on line 852 check for fuzzyZero is the incorrect utilization of that, the other solution would be to rearrange the calculation to prevent the inf from appearing and the NaN's when computing the power. I tried this and reran the whole design and did not get any NaN's. I reset the search/Power.cc and created a new patch with the change I made so you can see the code I ran: reordered.txt

If there is anything you may need from me, please let me know.

jjcherry56 commented 3 years ago

Well, I learned something about the nastyness of floating point numbers. I will change it to multiply by duty1 before the divide.

This has me really wondering where on earth the 2.0302513946e-34 comes from. That looks a lot like an uninitialized variable value. One way to track it down is with "sta::set_debug power 3" but I am guessing your design is too big for that to be reasonable. I am attachiing another patch that should catch it if you break on the print statement in PwrResult::activity(). small.txt

jjcherry56 commented 3 years ago

Never mind chasing the small numbers; I can reproduce them locally.

jjcherry56 commented 3 years ago

I pushed a commit that should fix it. bdd74687 power nan-proofing

gadfort commented 3 years ago

Thanks for looking into this.