Autodesk / sitoa

Arnold plugin for Softimage
Apache License 2.0
33 stars 16 forks source link

Support Arnold 5.3 #63

Closed JenusL closed 4 years ago

JenusL commented 5 years ago

Ticket to track Arnold 5.3 support. I will start to work on this soon.

JenusL commented 5 years ago

You can follow along here https://github.com/JenusL/sitoa/tree/feat/a5.3.0

JenusL commented 5 years ago

Ok so I just added the GPU functionality and hit a problem... closure shader isn't supported on the GPU!

# WARNING : [arnold] [gpu] shader name unknown to OptixProgramCache: closure
# WARNING : [arnold] [gpu] ignoring shader Sources.Materials.DefaultLib.Material.closure.SItoA.1000.1, node not supported on the GPU

Even before I encountered this, I had plans to skip the shader from being exported to Arnold, so that SItoA scenes would be more interchangeable with other packages. So I guess I have to fix that first. Unless there's a way to get custom shaders to render on the GPU. @caron and @sjannuz What do you think?

caron commented 5 years ago

sounds like we need to skip it... so if there are no custom nodes supported on the GPU then we got many other sitoa shaders which aren't going to work too.

sjannuz commented 5 years ago

Yes, we need to skip it. It's just a passthrough. Custom (SItoA-specific) c++ can't work right now. In the other plugins, we have ways to signal the user a given node is not supported.

JenusL commented 5 years ago

Ok thanks for the info. I've also seen that the other plugins now have translators from DCC shaders to Arnold shader trees. That would be something to consider as well an much of the code can probably be lifted from MtoA. But I'll just skip the closure as a first step so we can get a new version out the door.

That said, I've run in to some other issues as well… When I added Optix Denoiser I had some real struggle to get the render region to show the updates from Optix between each pass. Sort of the same problem is still there and can be seen on the lower IPR levels when rendering with just CPU. It's like Arnold is too fast and doesn't get Softimage enough breathing room to actually display the buckets that just got rendered. When rendering with GPU nothing gets showed at all until later in the rendering or if it gets stopped somehow, like just rotating the camera. Then it will show all the rendered pixels from before I rotated the camera. With CPU renderer, decreasing bucket size makes it worse, while increasing bucket size makes it better. This leads me to think that it's somehow related to the speed / number of calls to NewFragment() that's bogging down Softimage. Maybe we could use a timer or something to limit the calls or probably a better way is when current IPR / progressive frame is done, wait a tiny bit so that Softimage can display the result before going to the next progression. We're talking milliseconds here so shouldn't be noticible. I would really like some help in figuring this out the best way. Any relevant code is in DisplayDriver.cpp and RenderTile.cpp and maybe in RenderInstance.cpp as well.

JenusL commented 5 years ago

@sjannuz I have a question… One of the new features in 5.3 is this:

Denoising albedo: A built-in AOV with an albedo optimized for denoising has been added as denoise_albedo. Both the Arnold (noice) and OptiX™ denoisers will use the new AOV if present. (#6955, #8168, #8069)

Right now we export diffuse_albedo, N and Z for use in noice. Should I skip diffuse_albedo and export the new denoise_albedo instead? Or do noice want both? MtoA hasn't changed this yet so I don't have any reference to go on.

JenusL commented 5 years ago

Just ran the testsuite for the first time. I think every test with a shape instanced on a strand is failing. Two other tests crashed. Total of 10 failed tests. Will investigate tomorrow.

rmv commented 5 years ago

For noice, just replace diffuse_albedo with denoise_albedo

The executable accepts both, and will use denoise_albedo if found, and fall back to the older diffuse_albedo otherwise.

JenusL commented 5 years ago

Ok so I'm closing in on the reason why some strand tests fail. Bug seems to be in varying data on strands. https://github.com/Autodesk/sitoa/blob/52a0a4b35f51bccb6223a8206d04b40edc80071c/plugins/sitoa/loader/ICEHelpers.cpp#L2372-L2402

First AiArrayAllocate is called. Then we fill the array with data but we skip some array indices: A strand of 3 segments has 4 values in Arnold. I don't know why that is yet. Anyway, that 4th value is never set with AiArraySetX, but just left unset. Arnold 5.2 must have handled that ok. Maybe AiArrayAllocate in 5.2 filled it with zeroes, but in 5.3 I get random values sometimes, zeroes other times. I have to investigate further, just wanted to give a heads up. @sjannuz Anything changed with Arrays in 5.3?

JenusL commented 5 years ago

After further investigation, I found that there's nothing really wrong in SItoA or Arnold. The strand scene I'm looking at now (test_0025) has 4 strand segments on each strand. Then the "first segment fix" is added only to StrandPosition. So now StrandPosition has 5 segments but all other Strand attributes has only 4 segments. If all 4 Strand values are the same, it's considered uniform and exported as a constant to Arnold. Problem is with varying data. Then SItoA only exports the 4 values, skips the 5th because it doesn't exist in Softimage, then continue on to the next Strand. That leaves every 5th value in the varying array uninitilazed in Arnold, producing random values, NaNs, etc in that 5th value. My idea of a fix here is to abort the render and log an informative error message if the Strand* attribute exported as varying doesn't have the same array length as StrandPosition. Test scenes will probably fail after this so those have to be updated. Another way to fix it would be to just repeat the last value, or interpolate the array to the larger size. @sjannuz and @caron What do you think? Which fix would you prefer?

caron commented 5 years ago

@JenusL i haven't had a chance to wrap my head around the issue and i am not sure i will for a few more days.

in the mean time, can we cross reference the old trac ticket database to see if there is more detail as to what this code is doing and if/when it was "fixed" to resolve some bug or previous version arnold behavior?

also, when i ran the testsuite last i was passing all but 8 tests and only 1 of the strand tests failed (test_0195) but as a crash. i would hate to have to edit this area of the code if we don't have to.

sjannuz commented 5 years ago

I don't think anything has changed in Arnold related to arrays or curves. Unfortunately, I'll be fairly busy for the next week too, to be able to help you. Maybe if you could export a simple ascii .ass and compare kicking it with 5.2 and 5.3, we could ask the core guys about that.

JenusL commented 5 years ago

@caron Well the code is exporting IceAttributes to Arnold. Either way, it needs to be fixed. I will fire up old Mental Ray and see how that handles different sizes of the Strand* arrays.

@sjannuz I don't think I need your help, just your input on behavior when Strand attributes have different array sizes. I will se how Mental Ray does it and get back here. An .ass wouldn't help here as the error is in the ass-file as well. To be clear, it's not a bug in Arnold! The "bug" is that we don't handle Strand attributes that has less samples than StrandPosition. We have just been very lucky that it hasn't been caught in the tests in earlier Arnold versions. Beacuse I can now replicate it in earlier Arnold versions as well, but it's not showing as often.

Here's a very simple python script that simulate the issue in SItoA. Run it several times, and you will see that the skipped values will be different in the .ass-file.

from arnold import *

AiBegin()
AiMsgSetConsoleFlags(AI_LOG_INFO)

c = AiNode('curves', 'test_curve')

# two strands, each with 5 points
strands = [5, 5]
AiNodeSetArray(c, 'num_points', AiArray(len(strands), 1, AI_TYPE_UINT, *strands))

# strand data that has only 4 samples per strand
stranddata = [list(range(1,5)) for strand in strands]

# we still allocate the array to have 5 samples per strand
strandattrib = AiArrayAllocate(sum(strands), 1, AI_TYPE_FLOAT)

# this loop will naturally skip the nonexisting 5th element leaving it uninitialized
for offset, strandsegments in enumerate(strands):
    for i, data in enumerate(stranddata[offset]):
        AiArraySetFlt(strandattrib, i+offset*strandsegments, data)

AiNodeDeclare(c, 'StrandAttribute', 'varying FLOAT')
AiNodeSetArray(c, 'StrandAttribute', strandattrib)
AiASSWrite('varying_strandattribute.ass', AI_NODE_ALL, False, False)

AiEnd()

Again, this behavior is totally fine and just makes sense. So nothing is wrong in Arnold. We just need to fix the IceAttributes exporter to handle varying data on Strands in the cases where a StrandAttribute doesn't have the correct number of samples.

caron commented 5 years ago

@JenusL can you share a repo scene file with me? just send to my email.

test_0025 didn't fail for me previously and i ran it again on your `feat/a5.3.0' branch to confirm and it still doesn't fail. also, looking closer at test_0195, it is just failing on the noicon.pic/texture search path issue.

do we know exactly why StrandPosition is getting the extra point in it? i have a guess but i think we need to be sure, then we can decide how or if we should change anything.

JenusL commented 5 years ago

@caron Well test_0025 and the python script above is pretty good repos. It doesn't happen to me all the time either. You have to render, destroy, render, destroy several times for it to show up. Or in the case of the python script, run that several times as well. The uninitialized memory allocated by Arnold will change. Maybe it's Windows 10 that uses the memory differently than before. I have tested both on two separate machines and both Arnold 5.2.2.1 and Arnold 5.3 shows the same behavior. But in 5.2 I have to refresh more times for it to show up. So it's just a coincident that I caught this bug now when somehow most tests showed this behavior for me at the same time. Why StrandPosition has 5 samples while the other Strand attributes only have 4 is pretty clear in the scene of test_0025 here: arnold_first_segment So in other words, the test itself is a invalid scene. And all we might need to do is fix the tests. But still, SItoA doesn't do any checks at all and always assumes Strand attributes of same size as StrandPosition.

caron commented 5 years ago

ok, this reminded me of what the reason for this was. mental ray and arnold have different behavior for the particle position. mental ray always connects the particle position to the first point in StrandPosition

from 10 years ago! https://trac.solidangle.com/sitoa/ticket/310

so let's talk about this in #70

JenusL commented 5 years ago

Ok I've fixed the strand attributes issue now in #71 I had hoped that would fix all tests that failed for me but I'm afraid it didn't. It only fixed test_0025. I still have tests 56, 77, 104, 105, 119, 120 and 236 failing. They all have instanced meshes on ICE strands or hair objects. I've spent a long time trying to figure out what is wrong in the code but can't find anything wrong. If I compile with 5.2.2.1 all tests works, with 5.3.0.1 they all fail. The only thing I can think of is that the code for instanced shapes on strands and hair uses AiNodeClone() and then replaces vlist and nlist on the clone to create the instances. So I'm like 80% sure that the problem is there and that it's a bug in Arnold core. And since it's an API function, an .ass file wont help to figure out what's wrong. @sjannuz Any known core issues about this?

caron commented 5 years ago

I just need to chime in quickly and say those tests didn't fail for me earlier. We need to be sure we have the same failures. I am running the entire suite again using the latest branch.

JenusL commented 5 years ago

Please run the tests several times or open the tests in Softimage and render them several times. They doesn't always fail for me, but most of the times they do. I've tested on the two machines I have available to test on. Both are running the latest Win 10 (1809).

caron commented 5 years ago

ok i was able to get a lot of failures now with feat/a5.3.0 branch and 5.3.0.1 core.... and i also rolled the code back to master branch with 5.3.0.1 then ran the testsuite and i also got a lot failures.

so glancing at the release notes i noticed this in 5.3...

and in this in the bugfix release 5.3.0.1...

so maybe you're right, there is a new behavior or issue with how sitoa is updating those vlist and nlist parameters. maybe we can reproduce the issue with a simple python script. i am going to play around with that and see if i can learn anything. i also just want to understand what the new change does in a isolated place.

JenusL commented 5 years ago

Good to know you can reproduce as well. I don't think I've ran the tests with 5.3.0.0 so that could be done, just to see what happens. Also, let's not do any changes on this until we hear back from @sjannuz or anyone else at AD, so we know what's really going on.

JenusL commented 5 years ago

I just compiled with 5.3.0.0 and it has the same issues. So it's not any of the changes made in 5.3.0.1 that's causing it.

JenusL commented 5 years ago

Arnold 5.3.0.2 was released today and that is also failing the tests.

JenusL commented 5 years ago

I noticed another issue as well... I can't get OptiX Denoiser to work with GPU renderer. AiOutputIteratorGetNext() never gets RGBA_denoise when rendering with GPU. @sjannuz Is this a known limitation that isn't documented? Or is it a completely different mechanism to use OptiX Denoiser with GPU?

sjannuz commented 5 years ago

I could repro the cloning issue in a standalone with 5.3, I'll check with the core guys on Monday. The problem seems to derive from calling AiNodeSetArray and THEN modifying the array content. Doing the other ways around (fully setting the array and then assigning it) seems to work.

Here is a code snippet, cloning a box after loading it from a .ass:

void main(int argc, char **argv)
{
    AiBegin(AI_SESSION_BATCH);
    AiMsgSetConsoleFlags(AI_LOG_ALL);
    AiASSLoad(argv[1]);

    AtNode* box = AiNodeLookUpByName("/box");
    AtNode* box2 = AiNodeClone(box);

    AtArray* vlist = AiArrayCopy(AiNodeGetArray(box, "vlist"));
    AiNodeSetArray(box2, "vlist", vlist);

    uint32_t nb_elements = AiArrayGetNumElements(vlist);
    for (uint32_t i = 0; i < nb_elements; i++)
    {
        AtVector v = AiArrayGetVec(vlist, i);
        v.x += 2.0f;
        AiArraySetVec(vlist, i, v);
    }

    AiRender();
    AiEnd();
}
JenusL commented 5 years ago

Thank you very much for taking time to test this @sjannuz! So let's wait until we hear back from the core team and then decide what to do.

Did you have any info about my last comment here? About the Optix Denoiser when rendering on the GPU?

sjannuz commented 5 years ago

About the above code, since 5 we should not modify an array after having set it on the node. The fact that it worked it was just kind of luck.

JenusL commented 5 years ago

Ok good to know. I will fix it in SItoA then. Now when we know what the error is it should be pretty easy to fix.

caron commented 5 years ago

@JenusL my operator code, which re uses LoadShaderParameter, uses this set, get, modify technique on arrays. i put in a detailed commit message outlining that i did that...

599d1d87baf224b645093264f9ac0a8f4941d83d

i didn't actually want to do that but the existing code kinda did that already and it's recursive nature required it.

@sjannuz is this modify array behavior specific to shape node types?

sjannuz commented 5 years ago

@JenusL, about the RGBA_denoise problem, I can't focus on the problem at the moment. Are you able to provide any simple repro code ?

sjannuz commented 5 years ago

@caron , if you get and set the whole array all the times, you should be ok. What you can't do is modifying an array after setting it (only once) on a node. For that, you now need to map/unmap the array, there are dedicated API calls.

caron commented 5 years ago

@sjannuz aah, i see... i overlooked that in the simple repo program you posted. in your example code... get array, set array, then modify array little by little, instead of getting, modifying everything, then setting at the end.

@JenusL my operator code should be ok then.

JenusL commented 5 years ago

Ok so reading the 5.0.0.0 release notes I can se two changes that seems to have caused these issues:

  • AtArray is now opaque: Any operation involving AtArray objects has to be done through its new API, and not accessing its struct members which are now hidden. (#4082)

  • AiArrayAllocate() does not initialize memory: When an AtArray is created with AiArrayAllocate(), its contents are no longer initialized to zero and are instead left uninitialized. This results in slightly faster code. (#4507)

The second one must have been what caused test_0025 to fail.

Now what I find strange about this is how come the code worked all the time until Arnold 5.3...

JenusL commented 5 years ago

Fixed the AiNodeSetArray() issues in https://github.com/Autodesk/sitoa/pull/68/commits/0205f7a645589d2ff8344b3745e25c28cab94ebb I only ran the failed tests and they all pass now :)

@sjannuz No worries on the denoising. I have some ideas I'm gonna try and I can always talk to support about it if that doesn't work.

JenusL commented 5 years ago

So the problem with Optix Denoiser and GPU might be the fact that GPU only supports one filter at a time for all AOVs. When rendering, SItoA sends both RGBA and RGBA_denoise with separate filters to the display driver. This fails, probably because RGBA_denoise will use the output filter instead of the optix filter. So I tried to send only RGBA_denoise to the display driver and then Optix Denoiser works correctly. Will fix that right away...

JenusL commented 5 years ago

Optix Denoiser with GPU added in https://github.com/Autodesk/sitoa/pull/68/commits/87d07f45cf8eb6d9b93f36b4ba4a5ad44ff3c91c

I'm going to run the complete test suite now and if everything is fine I feel like it's time for release maybe. What do you guys think?

JenusL commented 5 years ago

Complete testsuite PASSED!

JenusL commented 5 years ago

Testing on a older workstation where the GPU is of older model and not even supported in Arnold 5.3 anymore (Kepler based) and I get instant crash when rendering in Softimage. Investigating...

sjannuz commented 5 years ago

I also get an instant crash on my older laptop. I can kick, but any render run in Soft crashes immediately.

JenusL commented 5 years ago

Yeah... I can't find anything wrong in the SItoA code. Kick, MtoA, and MAXtoA works fine. Here's a backtrace I got when running a test:

****
* Arnold 5.3.0.2 [f7602f75] windows icc-17.0.2 oiio-2.1.0 osl-1.11.0 vdb-4.0.0 clm-1.0.3.513 rlm-12.4.2 optix-6.0.0 2019/04/09 17:16:44
* CRASHED at 00:00:00, pixel (0, 0)
* signal caught: error C0000005 -- access violation
*
* backtrace:
*  0 0x00007ffb64cbbf4e [ai        ]
*  1 0x00007ffb64cbb1ef [ai        ]
*  2 0x00007ffbdd35667c [KERNELBASE] UnhandledExceptionFilter
*  3 0x00007ffbe091810b [ntdll     ] memset
*  4 0x00007ffbe08ffd56 [ntdll     ] _C_specific_handler
*  5 0x00007ffbe09146af [ntdll     ] _chkstk
*  6 0x00007ffbe0874bef [ntdll     ] RtlWalkFrameChain
*  7 0x00007ffbe091341e [ntdll     ] KiUserExceptionDispatcher
*
* loaded modules:
*    0x00007ffb649a0000  ai
*    0x00007ffbdd2d0000  KERNELBASE
*    0x00007ffbe0870000  ntdll
****
JenusL commented 5 years ago

@sjannuz So, any sugestion of how to proceed with this? Could I talk to core support about it or would I be ignored because Softimage is dead?

sjannuz commented 5 years ago

Do you have other plugins running on that same machine ?

JenusL commented 5 years ago

I have. But I decided to try it again and now it just works. So what I did differently between these tests is that I started with our in-house Launcher, that sets solidangle_LICENSE env vars. Then it works fine. What I had done before was just start Softimage from start menu, with no license env, and then it crashed. So maybe the bug is in how the core checks the DCC for a Arnold license... Anyway, as far as I know, Softimage doesn't include a Arnold license like Max and Maya does because Soft was killed before AD bought SA, right? So my guess is that somewhere there in that code the core is crashing.

JenusL commented 5 years ago

FYI, If you need to render without a license, Skip license check seems to also solve the crash issue.

JenusL commented 5 years ago

I also tried without other plugins loaded, and that crashes as well.

JenusL commented 5 years ago

Done some more testing and it's definitely the Autodesk License Manager that crashes. adlmint.dll is included with Arnold and in Softimage install. I guess it's the one in Softimage that gets used because it's already loaded into memory. And that causes Arnold to crash if it tries to connect to a Autodesk license server (flexLM). RLM license connection works fine.

I tried to put the new adlmint.dll from Arnold in the Softimage folder but then Softimage wont even start. Is this it? The final nail in the coffin for Softimage? 😢

JenusL commented 5 years ago

So I tried SItoA 5.2.0 and that has the same problem. Crashes on my workstation if I try to use Autodesk License Manager. So it's not a new thing in 5.3. I also tried with only SItoA loaded and no other plugins but that didn't work either.

Now I'm home again and can't recreate it at all on my home machine. Can't make it crash. This is very weird. Good thing with that is that maybe it isn't adlmint.dll after all but something else in the configuration. Since it was already there in 5.2.0 and went totally unnoticed I think I will just let this be. I'll just have to figure out what's wrong with my workstation.

JenusL commented 5 years ago

I opened a new ticket for the licensing crash. #74