Closed jonahfriedman closed 6 years ago
We're having the exact same issue here: Arnold 5.1.1.0 rendering archives exported via HtoA 3.0.2 from Houdini 16.5.473. Using Cryptomatte release 2.0.0 beta 4.
R111| 00:00:03 276MB WARNING | crypto_asset_filter00: could not set STRING parameter "mode"
R113| 00:00:03 276MB WARNING | crypto_asset_filter01: could not set STRING parameter "mode"
R114| 00:00:03 276MB WARNING | crypto_asset_filter02: could not set STRING parameter "mode"
R116| 00:00:03 276MB WARNING | crypto_object_filter00: could not set STRING parameter "mode"
R117| 00:00:03 276MB WARNING | crypto_object_filter01: could not set STRING parameter "mode"
R118| 00:00:03 276MB ERROR | Cryptomatte not set up correctly, crypto_object_filter02 rank not set.
R119| 00:00:03 276MB | Cryptomatte manifest created - 0.004000 seconds
R120| 00:00:03 276MB WARNING | [kick] render aborted due to earlier errors
Ironically, we just started having the same issue here too. I don't know what's causing it and can't reproduce it without sending a whole shot to the farm a few times.
First however, the warning about the "mode" parameter was fixed in a recent pull request and that seems to have mitigated it somewhat. After that, I realized that you could infer the rank from the name of the filter so once that's added in, it takes care of almost all of the remaining cases. While it's a dirty solution, I'm not sure what else to do for the time being.
What I'd really like is a reliable reproduction to submit to Solid Angle, which so far has remained elusive. I can kick off ass files in a loop for hours without encountering it. Maybe I just haven't found the right ass file for this. The error happens before bucket rendering starts, so rendering a tiny number of black pixels should theoretically show the defect.
I've pushed up the hacky fix to this in 8aae29f in my fork. If you'd like to test it @mheberlein, please feel free. I have seen the error on on line 86 on our farm, but only once. In that case the name of the filter was empty as well.
I'm beginning to suspect this is caused by parallel node init.
Thanks Jonah, I compiled your fork and am testing it on our renderfarm now, with and without parallel node init.
Thanks Michael, I'm very interested in the results.
The recent Arnold 5.1.1.1 seems to have fixed some parallel init bugs, so that might be another thing to test.
Good news! After rendering a few thousand frames with your fix and using Arnold 5.1.1.1, I didn't see a single Crypeomatte "rank not set" or "mode parameter" error.
Ah great. Is that with parallel node init on?
Also if you are able to search the logs, have you seen the message, "Cryptomatte Filter: Rank not set, inferred from name instead. "?
Just today I just found this (parallel node init on) - it appeared only once and the same client renders other jobs with the same configuration without errors.
| Arnold 5.1.1.1 [3849b993] windows icc-17.0.2 oiio-1.7.17 osl-1.9.0 vdb-4.0.0 clm-1.0.3.513 rlm-12.2.2 2018/06/26 21:12:06
| 1 x Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (6 cores, 12 logical) with 65373MB
[...]
| loading plugins from [...]\hou_htoa\htoa-3.0.3_houdini-16.5.496\arnold\plugins ...
| alembic_proc.dll: alembic uses Arnold 5.1.1.1
| cryptomatte.dll: cryptomatte uses Arnold 5.1.1.1
| cryptomatte.dll: cryptomatte_filter uses Arnold 5.1.1.1
| cryptomatte.dll: cryptomatte_manifest_driver uses Arnold 5.1.1.1
| htoa_mandelbulb_proc.dll: mandelbulb uses Arnold 5.1.1.1
| loaded 5 plugins from 3 lib(s) in 0:00.00
[...]
| there are 6 lights and 4 objects:
| 1 persp_camera
| 4 quad_light
| 2 skydome_light
| 7 image
| 1 utility
| 3 standard_surface
| 1 bump2d
| 1 normal_map
| 17 driver_exr
| 1 gaussian_filter
| 2 closest_filter
| 1 variance_filter
| 2 polymesh
| 1 list_aggregate
| 1 alembic
| 1 cryptomatte
[...]
| node initialization done in 0:00.04 (multithreaded)
| updating 49 nodes ...
| /obj/ENV_DIFF: skydome_light using 1 sample, 2 volume samples
| /obj/QUAD_BACK: quad_light using 1 sample, 2 volume samples
| /obj/QUAD_SPEC_DROP: disabled quad_light
| /obj/QUAD_BACK1: quad_light using 1 sample, 2 volume samples
| /obj/ENV_TRANS: skydome_light using 1 sample, disabled for volumes
ERROR | Cryptomatte Filter: rank not set, and could not be inferred from name.
| /obj/QUAD_TOP_PLANTS: disabled quad_light
| Cryptomatte manifest created - 0.000000 seconds
| scene bounds: (-1.36345279 0.043862462 -2.38756108) -> (1.35473156 0.991993845 1.43071795)
| node update done in 0:00.00 (multithreaded)
[...]
WARNING | [kick] render aborted due to earlier errors
WARNING | failed to launch bucket workers: received abort signal
I haven't really looked into the Cryptomatte internals yet and don't know what the rank is about or which name it's trying to derive it from. Anything I could/should test with that scene to help you debug it?
Sure, let me explain that. Each Cryptomatte is made of 3 other AOVs, which each have their own filters. These have "Ranks" set. So if you have "crypto_object", it creates:
crypto_object00
- with crypto_object_filter00
set to rank 0crypto_object01
- with crypto_object_filter01
set to rank 2crypto_object02
- with crypto_object_filter02
set to rank 4So each of these filters needs to have its "rank" set, so that it knows what ranks to output, an integer parameter. For some reason, rarely, these ranks just do not get set. This was reported, so I set the default rank value to -1 on the filter, and created an error condition if the filter starts filtering with that rank.
Then this started happening on Psyop's farm around the same time as parallel node init became default on (suspicious). So I updated things such that if the rank is not set, it tries to infer it from the name. It logs a warning in that case , and continues on.
By explaining all this, I'm suddenly realizing that maybe when the filter is created, the filter's node_update is running before the name and other properties on it are actually set, and maybe if the rank == -1 condition is detected the best thing to do would be to sleep a fraction of a second and then check again, or perhaps use the same lock used when setting up cryptomatte to synchronize running node_update on the filter?
Sounds good, or wait until the evaluate function is called the first time and do a lazy init? Pretty sure Solidangle can help figure out why they're sometimes not set?
I'm not sure about the lazy init, I think it'd be ok as far as data races go but it would be an un-synchronized write.
I wanted to figure out a reproduction before reporting it, so far the best reproduction involves a render farm. However I think this is a plausible theory of why this is happening, so maybe it's time.
I wonder if the best way might be to call AiNodeSetLocalData
where the filter is created rather than setting parameters on it.
Hi @mheberlein, after consulting with Angel at Solid Angle, I have updated my fork of the repo. It now just shares the same lock with Cryptomatte. I think this should totally fix it, however, I'm not actually able to test this with a render farm right now.
Michael, do you think you could help test it?
An update here: We're moving the repo to Psyop/CryptomatteArnold, this issue should be fixed, and there are builds posted for Linux and Windows. OS X should be up soon, at which time we'll replace this repo's readme.md with a note that it has moved.
I think this issue should now be solved. If not, feel free to reopen it (perhaps over on the other repo).
It's been reported to me by Dan Jahnel that occasionally (~3 times in a 250 frame sequence) it happens that frames come out wrong. I've seen these frames and what happens is, for example, that crypto_asset02 comes out the same as crypto_asset00, which means the rank in the filter was not set correctly. This is supposed to happen here when the filter is created.
This is happening in Arnold 5.1. I can't reproduce the issue myself, but I've added error detection for this state in 3d41ef5, in the PR for beta 4. If the filter's rank is left default (this should not happen), it'll error.
I'm putting this issue here to see if other users have it as well. Pre-beta 4 it will come up as a cryptomatte where selecting something results in double alpha values, and in Beta 4 it'll come up as an error message in your Arnold log that says something like: