Closed JenusL closed 4 years ago
You can follow along here https://github.com/JenusL/sitoa/tree/feat/a5.3.0
Ok so I just added the GPU functionality and hit a problem...
closure
shader isn't supported on the GPU!
# WARNING : [arnold] [gpu] shader name unknown to OptixProgramCache: closure
# WARNING : [arnold] [gpu] ignoring shader Sources.Materials.DefaultLib.Material.closure.SItoA.1000.1, node not supported on the GPU
Even before I encountered this, I had plans to skip the shader from being exported to Arnold, so that SItoA scenes would be more interchangeable with other packages. So I guess I have to fix that first. Unless there's a way to get custom shaders to render on the GPU. @caron and @sjannuz What do you think?
sounds like we need to skip it... so if there are no custom nodes supported on the GPU then we got many other sitoa shaders which aren't going to work too.
Yes, we need to skip it. It's just a passthrough. Custom (SItoA-specific) c++ can't work right now. In the other plugins, we have ways to signal the user a given node is not supported.
Ok thanks for the info. I've also seen that the other plugins now have translators from DCC shaders to Arnold shader trees. That would be something to consider as well an much of the code can probably be lifted from MtoA. But I'll just skip the closure as a first step so we can get a new version out the door.
That said, I've run in to some other issues as well… When I added Optix Denoiser I had some real struggle to get the render region to show the updates from Optix between each pass. Sort of the same problem is still there and can be seen on the lower IPR levels when rendering with just CPU. It's like Arnold is too fast and doesn't get Softimage enough breathing room to actually display the buckets that just got rendered. When rendering with GPU nothing gets showed at all until later in the rendering or if it gets stopped somehow, like just rotating the camera. Then it will show all the rendered pixels from before I rotated the camera. With CPU renderer, decreasing bucket size makes it worse, while increasing bucket size makes it better. This leads me to think that it's somehow related to the speed / number of calls to NewFragment() that's bogging down Softimage. Maybe we could use a timer or something to limit the calls or probably a better way is when current IPR / progressive frame is done, wait a tiny bit so that Softimage can display the result before going to the next progression. We're talking milliseconds here so shouldn't be noticible. I would really like some help in figuring this out the best way. Any relevant code is in DisplayDriver.cpp and RenderTile.cpp and maybe in RenderInstance.cpp as well.
@sjannuz I have a question… One of the new features in 5.3 is this:
Denoising albedo: A built-in AOV with an albedo optimized for denoising has been added as denoise_albedo. Both the Arnold (noice) and OptiX™ denoisers will use the new AOV if present. (#6955, #8168, #8069)
Right now we export diffuse_albedo
, N
and Z
for use in noice. Should I skip diffuse_albedo
and export the new denoise_albedo
instead? Or do noice want both?
MtoA hasn't changed this yet so I don't have any reference to go on.
Just ran the testsuite for the first time. I think every test with a shape instanced on a strand is failing. Two other tests crashed. Total of 10 failed tests. Will investigate tomorrow.
For noice
, just replace diffuse_albedo
with denoise_albedo
The executable accepts both, and will use denoise_albedo
if found, and fall back to the older diffuse_albedo
otherwise.
Ok so I'm closing in on the reason why some strand tests fail.
Bug seems to be in varying
data on strands.
https://github.com/Autodesk/sitoa/blob/52a0a4b35f51bccb6223a8206d04b40edc80071c/plugins/sitoa/loader/ICEHelpers.cpp#L2372-L2402
First AiArrayAllocate is called. Then we fill the array with data but we skip some array indices: A strand of 3 segments has 4 values in Arnold. I don't know why that is yet. Anyway, that 4th value is never set with AiArraySetX, but just left unset. Arnold 5.2 must have handled that ok. Maybe AiArrayAllocate in 5.2 filled it with zeroes, but in 5.3 I get random values sometimes, zeroes other times. I have to investigate further, just wanted to give a heads up. @sjannuz Anything changed with Arrays in 5.3?
After further investigation, I found that there's nothing really wrong in SItoA or Arnold. The strand scene I'm looking at now (test_0025) has 4 strand segments on each strand. Then the "first segment fix" is added only to StrandPosition. So now StrandPosition has 5 segments but all other Strand attributes has only 4 segments. If all 4 Strand values are the same, it's considered uniform and exported as a constant to Arnold. Problem is with varying data. Then SItoA only exports the 4 values, skips the 5th because it doesn't exist in Softimage, then continue on to the next Strand. That leaves every 5th value in the varying array uninitilazed in Arnold, producing random values, NaNs, etc in that 5th value. My idea of a fix here is to abort the render and log an informative error message if the Strand* attribute exported as varying doesn't have the same array length as StrandPosition. Test scenes will probably fail after this so those have to be updated. Another way to fix it would be to just repeat the last value, or interpolate the array to the larger size. @sjannuz and @caron What do you think? Which fix would you prefer?
@JenusL i haven't had a chance to wrap my head around the issue and i am not sure i will for a few more days.
in the mean time, can we cross reference the old trac ticket database to see if there is more detail as to what this code is doing and if/when it was "fixed" to resolve some bug or previous version arnold behavior?
also, when i ran the testsuite last i was passing all but 8 tests and only 1 of the strand tests failed (test_0195) but as a crash. i would hate to have to edit this area of the code if we don't have to.
I don't think anything has changed in Arnold related to arrays or curves. Unfortunately, I'll be fairly busy for the next week too, to be able to help you. Maybe if you could export a simple ascii .ass and compare kicking it with 5.2 and 5.3, we could ask the core guys about that.
@caron Well the code is exporting IceAttributes to Arnold. Either way, it needs to be fixed. I will fire up old Mental Ray and see how that handles different sizes of the Strand* arrays.
@sjannuz I don't think I need your help, just your input on behavior when Strand attributes have different array sizes. I will se how Mental Ray does it and get back here. An .ass wouldn't help here as the error is in the ass-file as well. To be clear, it's not a bug in Arnold! The "bug" is that we don't handle Strand attributes that has less samples than StrandPosition. We have just been very lucky that it hasn't been caught in the tests in earlier Arnold versions. Beacuse I can now replicate it in earlier Arnold versions as well, but it's not showing as often.
Here's a very simple python script that simulate the issue in SItoA. Run it several times, and you will see that the skipped values will be different in the .ass-file.
from arnold import *
AiBegin()
AiMsgSetConsoleFlags(AI_LOG_INFO)
c = AiNode('curves', 'test_curve')
# two strands, each with 5 points
strands = [5, 5]
AiNodeSetArray(c, 'num_points', AiArray(len(strands), 1, AI_TYPE_UINT, *strands))
# strand data that has only 4 samples per strand
stranddata = [list(range(1,5)) for strand in strands]
# we still allocate the array to have 5 samples per strand
strandattrib = AiArrayAllocate(sum(strands), 1, AI_TYPE_FLOAT)
# this loop will naturally skip the nonexisting 5th element leaving it uninitialized
for offset, strandsegments in enumerate(strands):
for i, data in enumerate(stranddata[offset]):
AiArraySetFlt(strandattrib, i+offset*strandsegments, data)
AiNodeDeclare(c, 'StrandAttribute', 'varying FLOAT')
AiNodeSetArray(c, 'StrandAttribute', strandattrib)
AiASSWrite('varying_strandattribute.ass', AI_NODE_ALL, False, False)
AiEnd()
Again, this behavior is totally fine and just makes sense. So nothing is wrong in Arnold. We just need to fix the IceAttributes exporter to handle varying data on Strands in the cases where a StrandAttribute doesn't have the correct number of samples.
@JenusL can you share a repo scene file with me? just send to my email.
test_0025 didn't fail for me previously and i ran it again on your `feat/a5.3.0' branch to confirm and it still doesn't fail. also, looking closer at test_0195, it is just failing on the noicon.pic/texture search path issue.
do we know exactly why StrandPosition
is getting the extra point in it? i have a guess but i think we need to be sure, then we can decide how or if we should change anything.
@caron Well test_0025 and the python script above is pretty good repos. It doesn't happen to me all the time either. You have to render, destroy, render, destroy several times for it to show up. Or in the case of the python script, run that several times as well. The uninitialized memory allocated by Arnold will change. Maybe it's Windows 10 that uses the memory differently than before. I have tested both on two separate machines and both Arnold 5.2.2.1 and Arnold 5.3 shows the same behavior. But in 5.2 I have to refresh more times for it to show up. So it's just a coincident that I caught this bug now when somehow most tests showed this behavior for me at the same time. Why StrandPosition has 5 samples while the other Strand attributes only have 4 is pretty clear in the scene of test_0025 here: So in other words, the test itself is a invalid scene. And all we might need to do is fix the tests. But still, SItoA doesn't do any checks at all and always assumes Strand attributes of same size as StrandPosition.
ok, this reminded me of what the reason for this was. mental ray and arnold have different behavior for the particle position. mental ray always connects the particle position to the first point in StrandPosition
from 10 years ago! https://trac.solidangle.com/sitoa/ticket/310
so let's talk about this in #70
Ok I've fixed the strand attributes issue now in #71
I had hoped that would fix all tests that failed for me but I'm afraid it didn't. It only fixed test_0025.
I still have tests 56, 77, 104, 105, 119, 120 and 236 failing. They all have instanced meshes on ICE strands or hair objects.
I've spent a long time trying to figure out what is wrong in the code but can't find anything wrong.
If I compile with 5.2.2.1 all tests works, with 5.3.0.1 they all fail.
The only thing I can think of is that the code for instanced shapes on strands and hair uses AiNodeClone()
and then replaces vlist
and nlist
on the clone to create the instances.
So I'm like 80% sure that the problem is there and that it's a bug in Arnold core. And since it's an API function, an .ass file wont help to figure out what's wrong.
@sjannuz Any known core issues about this?
I just need to chime in quickly and say those tests didn't fail for me earlier. We need to be sure we have the same failures. I am running the entire suite again using the latest branch.
Please run the tests several times or open the tests in Softimage and render them several times. They doesn't always fail for me, but most of the times they do. I've tested on the two machines I have available to test on. Both are running the latest Win 10 (1809).
ok i was able to get a lot of failures now with feat/a5.3.0
branch and 5.3.0.1 core.... and i also rolled the code back to master
branch with 5.3.0.1 then ran the testsuite and i also got a lot failures.
so glancing at the release notes i noticed this in 5.3...
and in this in the bugfix release 5.3.0.1...
so maybe you're right, there is a new behavior or issue with how sitoa is updating those vlist
and nlist
parameters. maybe we can reproduce the issue with a simple python script. i am going to play around with that and see if i can learn anything. i also just want to understand what the new change does in a isolated place.
Good to know you can reproduce as well. I don't think I've ran the tests with 5.3.0.0 so that could be done, just to see what happens. Also, let's not do any changes on this until we hear back from @sjannuz or anyone else at AD, so we know what's really going on.
I just compiled with 5.3.0.0 and it has the same issues. So it's not any of the changes made in 5.3.0.1 that's causing it.
Arnold 5.3.0.2 was released today and that is also failing the tests.
I noticed another issue as well... I can't get OptiX Denoiser to work with GPU renderer. AiOutputIteratorGetNext()
never gets RGBA_denoise
when rendering with GPU.
@sjannuz Is this a known limitation that isn't documented? Or is it a completely different mechanism to use OptiX Denoiser with GPU?
I could repro the cloning issue in a standalone with 5.3, I'll check with the core guys on Monday. The problem seems to derive from calling AiNodeSetArray and THEN modifying the array content. Doing the other ways around (fully setting the array and then assigning it) seems to work.
Here is a code snippet, cloning a box after loading it from a .ass:
void main(int argc, char **argv)
{
AiBegin(AI_SESSION_BATCH);
AiMsgSetConsoleFlags(AI_LOG_ALL);
AiASSLoad(argv[1]);
AtNode* box = AiNodeLookUpByName("/box");
AtNode* box2 = AiNodeClone(box);
AtArray* vlist = AiArrayCopy(AiNodeGetArray(box, "vlist"));
AiNodeSetArray(box2, "vlist", vlist);
uint32_t nb_elements = AiArrayGetNumElements(vlist);
for (uint32_t i = 0; i < nb_elements; i++)
{
AtVector v = AiArrayGetVec(vlist, i);
v.x += 2.0f;
AiArraySetVec(vlist, i, v);
}
AiRender();
AiEnd();
}
Thank you very much for taking time to test this @sjannuz! So let's wait until we hear back from the core team and then decide what to do.
Did you have any info about my last comment here? About the Optix Denoiser when rendering on the GPU?
About the above code, since 5 we should not modify an array after having set it on the node. The fact that it worked it was just kind of luck.
Ok good to know. I will fix it in SItoA then. Now when we know what the error is it should be pretty easy to fix.
@JenusL my operator code, which re uses LoadShaderParameter, uses this set, get, modify technique on arrays. i put in a detailed commit message outlining that i did that...
599d1d87baf224b645093264f9ac0a8f4941d83d
i didn't actually want to do that but the existing code kinda did that already and it's recursive nature required it.
@sjannuz is this modify array behavior specific to shape node types?
@JenusL, about the RGBA_denoise problem, I can't focus on the problem at the moment. Are you able to provide any simple repro code ?
@caron , if you get and set the whole array all the times, you should be ok. What you can't do is modifying an array after setting it (only once) on a node. For that, you now need to map/unmap the array, there are dedicated API calls.
@sjannuz aah, i see... i overlooked that in the simple repo program you posted. in your example code... get array, set array, then modify array little by little, instead of getting, modifying everything, then setting at the end.
@JenusL my operator code should be ok then.
Ok so reading the 5.0.0.0 release notes I can se two changes that seems to have caused these issues:
AtArray
is now opaque: Any operation involvingAtArray
objects has to be done through its new API, and not accessing its struct members which are now hidden. (#4082)
AiArrayAllocate()
does not initialize memory: When anAtArray
is created withAiArrayAllocate()
, its contents are no longer initialized to zero and are instead left uninitialized. This results in slightly faster code. (#4507)
The second one must have been what caused test_0025 to fail.
Now what I find strange about this is how come the code worked all the time until Arnold 5.3...
Fixed the AiNodeSetArray()
issues in https://github.com/Autodesk/sitoa/pull/68/commits/0205f7a645589d2ff8344b3745e25c28cab94ebb
I only ran the failed tests and they all pass now :)
@sjannuz No worries on the denoising. I have some ideas I'm gonna try and I can always talk to support about it if that doesn't work.
So the problem with Optix Denoiser and GPU might be the fact that GPU only supports one filter at a time for all AOVs. When rendering, SItoA sends both RGBA
and RGBA_denoise
with separate filters to the display driver.
This fails, probably because RGBA_denoise
will use the output filter instead of the optix filter.
So I tried to send only RGBA_denoise
to the display driver and then Optix Denoiser works correctly.
Will fix that right away...
Optix Denoiser with GPU added in https://github.com/Autodesk/sitoa/pull/68/commits/87d07f45cf8eb6d9b93f36b4ba4a5ad44ff3c91c
I'm going to run the complete test suite now and if everything is fine I feel like it's time for release maybe. What do you guys think?
Complete testsuite PASSED!
Testing on a older workstation where the GPU is of older model and not even supported in Arnold 5.3 anymore (Kepler based) and I get instant crash when rendering in Softimage. Investigating...
I also get an instant crash on my older laptop. I can kick, but any render run in Soft crashes immediately.
Yeah... I can't find anything wrong in the SItoA code. Kick, MtoA, and MAXtoA works fine. Here's a backtrace I got when running a test:
****
* Arnold 5.3.0.2 [f7602f75] windows icc-17.0.2 oiio-2.1.0 osl-1.11.0 vdb-4.0.0 clm-1.0.3.513 rlm-12.4.2 optix-6.0.0 2019/04/09 17:16:44
* CRASHED at 00:00:00, pixel (0, 0)
* signal caught: error C0000005 -- access violation
*
* backtrace:
* 0 0x00007ffb64cbbf4e [ai ]
* 1 0x00007ffb64cbb1ef [ai ]
* 2 0x00007ffbdd35667c [KERNELBASE] UnhandledExceptionFilter
* 3 0x00007ffbe091810b [ntdll ] memset
* 4 0x00007ffbe08ffd56 [ntdll ] _C_specific_handler
* 5 0x00007ffbe09146af [ntdll ] _chkstk
* 6 0x00007ffbe0874bef [ntdll ] RtlWalkFrameChain
* 7 0x00007ffbe091341e [ntdll ] KiUserExceptionDispatcher
*
* loaded modules:
* 0x00007ffb649a0000 ai
* 0x00007ffbdd2d0000 KERNELBASE
* 0x00007ffbe0870000 ntdll
****
@sjannuz So, any sugestion of how to proceed with this? Could I talk to core support about it or would I be ignored because Softimage is dead?
Do you have other plugins running on that same machine ?
I have. But I decided to try it again and now it just works.
So what I did differently between these tests is that I started with our in-house Launcher, that sets solidangle_LICENSE
env vars. Then it works fine.
What I had done before was just start Softimage from start menu, with no license env, and then it crashed. So maybe the bug is in how the core checks the DCC for a Arnold license...
Anyway, as far as I know, Softimage doesn't include a Arnold license like Max and Maya does because Soft was killed before AD bought SA, right? So my guess is that somewhere there in that code the core is crashing.
FYI, If you need to render without a license, Skip license check
seems to also solve the crash issue.
I also tried without other plugins loaded, and that crashes as well.
Done some more testing and it's definitely the Autodesk License Manager that crashes.
adlmint.dll
is included with Arnold and in Softimage install. I guess it's the one in Softimage that gets used because it's already loaded into memory. And that causes Arnold to crash if it tries to connect to a Autodesk license server (flexLM).
RLM license connection works fine.
I tried to put the new adlmint.dll
from Arnold in the Softimage folder but then Softimage wont even start.
Is this it? The final nail in the coffin for Softimage? 😢
So I tried SItoA 5.2.0 and that has the same problem. Crashes on my workstation if I try to use Autodesk License Manager. So it's not a new thing in 5.3. I also tried with only SItoA loaded and no other plugins but that didn't work either.
Now I'm home again and can't recreate it at all on my home machine. Can't make it crash. This is very weird.
Good thing with that is that maybe it isn't adlmint.dll
after all but something else in the configuration.
Since it was already there in 5.2.0 and went totally unnoticed I think I will just let this be. I'll just have to figure out what's wrong with my workstation.
I opened a new ticket for the licensing crash. #74
Ticket to track Arnold 5.3 support. I will start to work on this soon.