AcademySoftwareFoundation / OpenShadingLanguage

Advanced shading language for production GI renderers
BSD 3-Clause "New" or "Revised" License
2.07k stars 350 forks source link

Fix userdata binding corner case #1673

Closed aconty closed 1 year ago

aconty commented 1 year ago

Description

This includes two fixes:

1) When registering symbols that need user data, sort the entries in the set so the layer number is ignored. A needed udata iteam shouldn't depend on the layer and separating them makes find_userdata_index() sometimes find an index with different derivs status.

2) osl_bind_interpolated_param() is memcpy'ing derivs that might not be there, yielding corrupted derivs and possibly a crash.

Checklist:

lgritz commented 1 year ago

So this is the fix for the weird problem we were looking at last week?

Background: For those outside SPI, we had stumbled across a shader network that produced incorrect results when we DISABLED instance merging (when two shader instance nodes in a group have identical inputs and do identical computations, they are merged and the newly single node's outputs are sent to wherever the original ones' outputs went). We're accustomed to occasionally seeing some kind of logic error on our/my part where two nodes were merged when it wasn't actually safe to do so, but in this case, the merged result was correct but when merging was disabled, it was wrong. Real head scratcher.

Looks like Alex has tracked it down, which is awesome and an impressive feat of detective work.

Alex, I wonder if you could spell out for us exactly what was going wrong and how this fixes things? I think I get the gist -- there were two nodes that needed the same bit of named userdata, but one needed the derivs of the userdata and the other did not. Once merged, the derivatives would be retrieved by the merged node, since its output would be connected to the downstream sink that needed derivs. But when unmerged, if the one that didn't need derivs retrieved the userdata first, no derivs would be present (and by the time the one that needed derivs runs, the userdata is already computed and cached, deriv-free).

I think I get that part, but I am still a little hazy about how exactly this is fixed by removing the layer index from the sorting criteria of the list of userdata we will need. Won't you still have two entries in the list, one needing derivs and the other not, and so there is some kind of order dependence you are inadvertently relying on? Oh, is it a set of some kind, so there is not a second entry at all if the sort doesn't make them look different? But even in that case, if the first one added to the set doesn't need derivs but the second does, what is the mechanism by which the record already in the set becomes aware that it needs that userdata to compute derivs?

aconty commented 1 year ago

Exactly, it is a set<> built on that index that was considering the layer num. Now that it ignores it both layer will get the same entry in the set, and since one needs derivs, it will promote the entry. Without this change two entries are put in the set, and then find_userdata_index() will get the first one, which may not have derivs and fail to retrieve them.

We had to identical layers asking for Pref [ lockgeom off ]. One needed derivs, the other one did. When not merging we were unlucky and find_userdata_index() was going always for the one without derivs.