CGRU / cgru

CGRU - AFANASY
http://cgru.info/
GNU Lesser General Public License v3.0
278 stars 111 forks source link

Blocks with no tasks crash afanasy #545

Closed sjt-rvx closed 2 years ago

sjt-rvx commented 2 years ago

Hi, I'm running v3.2.2 and ran into this fun issue:

I am creating some blocks via Python, and was testing adding tasks manually (instead of setNumeric) but I made an error so some block neither got setNumeric nor got tasks appended. The resulting job.send() crashed the server. Running job.output() gave me this json file (slightly modified).

{
    "blocks": [
        {
            "capacity": 100,
            "depend_mask": "|RVXGenerateDailies-307012_1210|SystemCommand-307012_1210",
            "environment": {
                "ARNOLD_ROOT": "/path/to/arnold",
            },
            "flags": 1,
            "frame_first": 1,
            "frame_last": 1,
            "frames_inc": 1,
            "frames_per_task": 1,
            "name": "RENDER",
            "parser": "generic",
            "service": "generic",
            "working_directory": "/render/gaffer/dispatcher/afanasy/render_file/000020"
        },
        {
            "capacity": 100,
            "depend_mask": "|ImageWriterCrypto-307012_1210",
            "environment": {
                "ARNOLD_ROOT": "/path/to/arnold",
            },
            "flags": 1,
            "frame_first": 1,
            "frame_last": 1,
            "frames_inc": 1,
            "frames_per_task": 1,
            "name": "RVXGenerateDailies-307012_1210",
            "parser": "generic",
            "service": "generic",
            "working_directory": "/render/gaffer/dispatcher/afanasy/render_file/000020"
        },
        {
            "capacity": 100,
            "depend_mask": "|ImageWriter-307012_1210",
            "environment": {
                "ARNOLD_ROOT": "/path/to/arnold",
            },
            "flags": 0,
            "name": "ImageWriterCrypto-307012_1210",
            "parser": "generic",
            "service": "generic",
            "working_directory": "/render/gaffer/dispatcher/afanasy/render_file/000020"
        },
        {
            "capacity": 100,
            "depend_mask": "|FrameMask-307012_1210",
            "environment": {
                "ARNOLD_ROOT": "/path/to/arnold",
            },
            "flags": 0,
            "name": "ImageWriter-307012_1210",
            "parser": "generic",
            "service": "generic",
            "working_directory": "/render/gaffer/dispatcher/afanasy/render_file/000020"
        },
        {
            "capacity": 100,
            "depend_mask": "|arnold_render-307012_1210",
            "environment": {
                "ARNOLD_ROOT": "/path/to/arnold",
            },
            "flags": 1,
            "frame_first": 1,
            "frame_last": 1,
            "frames_inc": 1,
            "frames_per_task": 1,
            "name": "FrameMask-307012_1210",
            "parser": "generic",
            "service": "generic",
            "working_directory": "/render/gaffer/dispatcher/afanasy/render_file/000020"
        },
        {
            "capacity": 1000,
            "environment": {
                "ARNOLD_ROOT": "/path/to/arnold",
            },
            "flags": 0,
            "name": "arnold_render-307012_1210",
            "parser": "arnold",
            "service": "arnold",
            "tasks": [
                {
                    "command": "/path_to_gaffer execute -script /render/gaffer/dispatcher/afanasy/render_file/000020/render_file.gfr -nodes node_OUTPUT.arnold_render -frames 1001-1001 -context -wedge:index \"0\" -dispatcher:scriptFileName \"'/render/gaffer/dispatcher/afanasy/render_file/000020/render_file.gfr'\" -dispatcher:jobDirectory \"'/render/gaffer/dispatcher/afanasy/render_file/000020'\" -user \"'sveinbjorn'\"",
                    "name": "frame 1001"
                },
                {
                    "command": "/path_to_gaffer execute -script /render/gaffer/dispatcher/afanasy/render_file/000020/render_file.gfr -nodes node_OUTPUT.arnold_render -frames 1002-1002 -context -wedge:index \"0\" -dispatcher:scriptFileName \"'/render/gaffer/dispatcher/afanasy/render_file/000020/render_file.gfr'\" -dispatcher:jobDirectory \"'/render/gaffer/dispatcher/afanasy/render_file/000020'\" -user \"'sveinbjorn'\"",
                    "name": "frame 1002"
                }
            ],
            "tickets": {
                "ARNOLD": 1
            },
            "working_directory": "/render/gaffer/dispatcher/afanasy/render_file/000020"
        },
        {
            "capacity": 100,
            "environment": {
                "ARNOLD_ROOT": "/path/to/arnold",
            },
            "flags": 0,
            "name": "SystemCommand-307012_1210",
            "parser": "generic",
            "service": "generic",
            "working_directory": "/render/gaffer/dispatcher/afanasy/render_file/000020"
        }
    ],
    "host_name": "hraun",
    "hosts_mask": "",
    "name": "render_file",
    "offline": true,
    "pools": {
        "/iceland/renderfarm": 100,
        "/iceland/workstations": 100
    },
    "priority": 99,
    "time_creation": 1655992173,
    "user_name": "sveinbjorn"
}

So I'm guessing, to reproduce the issue you could run

block = af.Block('block', 'generic')
job = af.Job('thejob')
job.blocks.append(block)
job.send()

Although I haven't tested this since I don't want to bring our production afanasy server down

lithorus commented 2 years ago

The above doesn't actually crash the server. This will :

job = af.Job('thejob')

nonNumBlock = af.Block('non-num block', 'generic')

numBlock = af.Block('num block', 'generic')
numBlock.setNumeric(1, 1, 1)

job.blocks.append(nonNumBlock)
job.blocks.append(numBlock)

job.send(verbose=True)

It seems to be a combination of having a non-numeric block with no tasks before a numeric block with tasks. Also, the server complains about blocks with no tasks, but doesn't crash unless the mentioned combination happens.

The crash seems to come from parsing the json input.

Tip : set your systemd service to restart=always which will restart the server on crash. If you don't want to edit the .service file look into how to create .service.d/override.conf files for service overrides.

I'm currently looking into how creating a check in the af module before it's sent to the server.

timurhai commented 2 years ago

It should be fixed it on the server side too. As jobs can be sent w/o Python API.

lithorus commented 2 years ago

I totally agree. However with the checkJob() it's easier to report why it failed and more checks can be added.

sjt-rvx commented 2 years ago

Ok, cool. Good to know - I was afraid of testing this any further since I didn't have the chance to work on a non-production server - and for some reason people get annoyed if you keep stopping their work :)