afserver breaks job, which causes 'Segmentation fault (core dumped)' crashes

eberrippe commented 1 year ago

Hello @timurhai,

We recently updated to version 3.3.0 and encountered an issue with our "testjob" that was run in January. The job was left on the farm and two days ago the Afserver crashed with the message "Segmentation fault (core dumped)". Despite our attempts to restart the server, it kept crashing.

Upon further investigation, we discovered that the "testjob" was causing the issue, and moving it out of the /usr/tmp/afanasy/jobs/0 directory allowed the server to start normally. Additionally, we found the following information:

The server can be started with the testing job inside, but it will crash as soon as a render connects to the Afserver.
Removing the block content from the attached data.json file resolves the issue.
Changing the block flag to 33 also prevents the server from crashing.
Copying a block.json file from another block also resolves the issue.

We are concerned about this issue and would appreciate your help in resolving it. We have developed an "afjobsFixer" script that checks all jobs in the Afanasy job structure. If the job has no tasks folder, no block.json file, and the data.json file has a flag of 32, we change it to flag 33. This fixes any broken jobs before we restart the server.

We have attached all the information we have regarding this issue. Thank you in advance for your assistance.

Afserver log: Fri 24 Feb 10:10.42: INFO Listening 51000 port... Fri 24 Feb 10:10.56: Render: stationxxxx@eberrippe[7] unix linux 64 3.3.0 127.0.0.1 ON Segmentation fault (core dumped)

Strucuture of data.json jobStructure

Content of data.json { "name":"test_py_cmd-1", "id":3, "priority":201, "custom_data":"{\"xx\": \"xxx\", \"xx\": \"xxx\"}", "solve_method":"solve_order", "solve_need":"solve_tasksnum", "serial":2, "user_name":"xxx", "host_name":"stationxxx", "st":1, "state":" RDY", "ignorenimby":true, "ignorepaused":true, "branch":"/", "user_list_order":0, "time_creation":1673028727, "time_started":1676974801, "blocks":[ { "command":"python -c 'print(1)'", "tasks_name":"", "parser":"generic", "working_directory":"/tmp", "name":"test_py", "service":"generic", "capacity":1000, "flags":32, "tasks_num":1, "frame_first":0, "frame_last":0, "frames_per_task":1, "frames_inc":1, "time_started":1673028727, "time_done":1673028734, "hosts_mask":"stationxxxx.*", "block_num":0 }] }

timurhai commented 1 year ago

Hi! Playing with the first flags bit, you switch block to numeric. https://github.com/CGRU/cgru/blob/master/afanasy/src/libafanasy/blockdata.h#L53 Yes, not numeric jobs should contain tasks information. Not numeric should not, tasks are generated by demand (on-the-fly). So such jobs are really broken. How such jobs can appear? After what.

eberrippe commented 1 year ago

Hi @timurhai , thanks for your answer! :)

We have no clue how this happened... the job was on the farm for 1 month already and suddenly "segmentation fault.".

Is there a fix you could patch at some point for afanasy to import check for those broken jobs?

timurhai commented 1 year ago

Strange situation, never heard about it. So you send a job with a numeric block. Than for some reason server stops (craches). And its store became broken? May be it happens with some kind of jobs only? Can you check your store on a working server, when the store breaks? May be some kind of jobs it always store broken (- and this is the bug).

Yes I can write some patch that force job to be not numeric, if it is no tasks data.

But better to find out the reason, there is some other bug.

eberrippe commented 1 year ago

Awesome Timur, thanks.

Yeah I can totally understand that we have to find the actual reason.

It was a generic job, just echo-ing a system.environment variable. Like mentioned, it ran 50 times before already.

Also it never happened before and since then it never happened again.

You can reproduce the error as mentioned above super easily, by copying the content of the data.json.

Honestly, my best guess of how we broke it would be that we might have restarted it to often from different monitors. Like some multiple access collision. Do you think it could be possible?

How can I check the store, to give you the information?

Like mentioned. If I copy the job in the jobs directory /usr/tmp/afanasy/jobs/0, I can start the server. But as soon as connect a render to the server the segmentation fault happens. So I would guess its a scheduler problem?

Best Jan

sebastianelsner commented 1 year ago

It seems this was a one time thing. So if it is not happening again "soon", just let it go (like: here)

timurhai commented 1 year ago

Hi!

I just add a check for a validness reading a job from the store. That check was written to check incoming new jobs that can be invalid constructed - numeric and tasks data mess. Now such jobs will be deleted from the store.

But I still do not know how a new job can pass this test on the register. And later cannot pass on store reading. The Job was initially numeric, so this is not a situation when tasks data was not stored for some reason. And it is not a situation on server hung during storing, as the json file has no syntax errors. This is a situation when somehow we lost the numeric flag. Or there is some bug on the server, or its store was modified by something else.

Anyway, stored job checking is needed.

CGRU / cgru

afserver breaks job, which causes 'Segmentation fault (core dumped)' crashes #563