LNST-project / lnst

Linux Network Stack Test
GNU General Public License v2.0
73 stars 33 forks source link

Fix job cleanup on agents #355

Closed Kuba314 closed 9 months ago

Kuba314 commented 10 months ago

Description

See commit message bodies for more info.

Tests

Normal SimpleNetworkRecipe execution passes, interrupting with ^C during iperf test also finishes well and doesn't crash agents. Not sure if this is enough testing though...

Closes: #354

Kuba314 commented 10 months ago

@olichtne The cleanup() method was on JobContext, which is only used in RemoteMethods and cannot actually be called via RPC, or have I misunderstood?

Btw, https://github.com/LNST-project/lnst/blob/master/lnst/Agent/Agent.py#L1059 get_cmd and del_cmd don't actually exist anymore, so the whole "exception" handler can be removed? Which is weird because we are probably sometimes sending "exception" messages at:

olichtne commented 10 months ago

@olichtne The cleanup() method was on JobContext, which is only used in RemoteMethods and cannot actually be called via RPC, or have I misunderstood?

ah, you're right, i misread that...


Btw, https://github.com/LNST-project/lnst/blob/master/lnst/Agent/Agent.py#L1059 get_cmd and del_cmd don't actually exist anymore, so the whole "exception" handler can be removed? Which is weird because we are probably sometimes sending "exception" messages at:

so there are 2 exception type messages:

  1. what you've linked is exception messages that are provided to send_data_to_ctl method, this sends a message from an Agent process directly to the Controller process
  2. there are exception messages from Jobs: https://github.com/LNST-project/lnst/blob/master/lnst/Agent/Job.py#L114 this is sent from a Forked Agent process running a Job._run method to the main Agent process... there it should be handled by the exception message handler you linked.

So I think this should be kept as this has a different, still existing use case.

However, I'm not sure why the get_cmd and del_cmd method calls are still there and why we haven't seen any problems with them recently.

This should be testable by runing a test module linke this:

class MyTestModule(BaseTestModule):
    def run():
        sleep(2) # simulate runtime
        raise Exception("exception from Job process")
Kuba314 commented 10 months ago

@olichtne I'm seeing the following in host1's DEBUG output:

    Traceback (most recent call last):
      File "/lnst/lnst/Agent/Job.py", line 263, in run
        self._result["passed"] = self._what["module"].run()
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/root/.cache/pypoetry/virtualenvs/lnst-tHnIm-F_-py3.12/bin/cache/1e90595123b49a02d0c371edd1d0e9c9d778da4db94c6657ddc1c4484b1e255e", line 12, in run
        raise Exception("exception from Job process")
    Exception: exception from Job process

This is where the test module gets run: https://github.com/LNST-project/lnst/blob/master/lnst/Agent/Job.py#L262. Any exception gets caught right here and its traceback is printed to agent's debug. There's however another try-catch which I don't think will ever trigger at https://github.com/LNST-project/lnst/blob/master/lnst/Agent/Job.py#L110.

olichtne commented 9 months ago

got it. makes sense... i'd still keep these except branch blocks and the exception message handling in the Agent for now. It may look useless right now but i'm not 100% sure and i think it at least doesn't have to be part of this PR.

@Kuba314 anything else that you want to do for this PR since you're keeping it in Draft status? or can we proceed with merging this in?

Kuba314 commented 9 months ago

Oops, no, sorry. It was just a proposed fix at the time of creation but now it's tested and approved by you so I'm marking it ready.