VOLTTRON / volttron

VOLTTRON Distributed Control System Platform
https://volttron.readthedocs.io/
Other
454 stars 216 forks source link

System Test leads to unlabelled agent that then must be manually removed (non-auto) #2510

Open MachinesJesus opened 3 years ago

MachinesJesus commented 3 years ago

Description of Issue

System Test leads to unlabelled agent that then must be manually removed (non-auto)

Affected Version

7.0 but can test any.

Screenshots

Expected

time.sleep(1)
os.system('vctl remove --tag aws* --force')

Actual

About 20% of the time this fails. Then I have to go and manually remove the agent.

Steps to Reproduce

Use: os.system('vctl remove --tag aws* --force')

Change aws* to your agent prefix. Repeat the system remove, re-build, run cycle. 20% of the time it will fail in the removal process, requiring intervention.

System should have min 3 agents that are auto-remove and rebuilt.

Additional Details

(volttron) ubuntu@ubuntu:~/te/vtrn$ vctl status AGENT IDENTITY TAG STATUS HEALTH 0 awspagent-0.1 awspagent-0.1_1 awsp1
1 awssagent-0.1 awssagent-0.1_2 awss1
e awssagent-0.1 awssagent-0.1_1
9 ec2marketagent-0.1 ec2marketagent-0.1_1 aws_ec2

This shows me manually removing...

(volttron) ubuntu@ubuntu:~/te/vtrn$ vctl remove e Removing ebacd847-6e6a-4c8e-aad3-b6701dc96055 awssagent-0.1

MachinesJesus commented 3 years ago

Really I just need a 'remove all agents' that works reliably. For my system tests I re-build all the agents each time (only take ~20 secs).

craig8 commented 3 years ago

@MachinesJesus can you tell me how many agents you are targeting? From what I see you have 4. I have a feeling there is some timeouts going on that are happening.

Using the scripts/install-agent.py --force script you can rebuild/install without replacing the payloads or changing vip identities.

I will take a look and see if I can reproduce on my system and go from there as far as figuring out the issue.

MachinesJesus commented 3 years ago

Yeah I think you may have said add time delays on Slack. So I have 1 sec between every remove.

I guess the questions is: how does the agent get created, or half deleted and end up tagless.

I believe the tagless nature of the ghost agent may indicate where the code needs to be more robust.

craig8 commented 3 years ago

that is truly a question...one thing that I do know using the normal time.sleep in python that's probably not going to help because of the coroutine (I mean it may)...but using gevent.sleep allows switching of context until the command is done. I will see why this is not being atomic in nature...because I believe that is truly what's going on here.

MachinesJesus commented 3 years ago

Here is the data: Starting VOLTTRON verbosely in the background with VOLTTRON_HOME=/home/ubuntu/.volttron Waiting for VOLTTRON to startup.. VOLTTRON startup complete

Before Remove:

8 awspagent-0.1 awspagent-0.1_1 awsp1
c awssagent-0.1 awssagent-0.1_1 awss1
a ec2marketagent-0.1 ec2marketagent-0.1_1 aws_ec2
AGENT IDENTITY TAG PRI

Removing agents...

remove: error: agent not found: bat remove: error: agent not found: listener remove: error: agent not found: zig remove: error: agent not found: plc Removing c9b7df8b-16e5-4a27-a2b2-8afa2f31f554 awssagent-0.1 Removing a4e8e681-02cc-4683-93fb-87ccb2b6edb6 ec2marketagent-0.1 Removing 89a6e12c-cfe4-42b4-b4bd-eaea6017a3a9 awspagent-0.1 remove: error: agent not found: test remove: error: agent not found: stupid

After Remove: (After full remove should be empty below) No installed Agents found

Re-installing...

{ "agent_uuid": "99f74afc-d5f9-4c4c-a29b-21eed9a70ee7" } ERROR:install-agent.py: Error installing agent: Below Command failed with non zero exit code. Command:['/home/ubuntu/te/vtrn/env/bin/volttron-ctl', 'install', '/home/ubuntu/.volttron/packaged/awspagent-0.1-py3-none-any.whl', '--tag', 'awsp1'] Stderr: b'2020-11-11 10:41:39,775 () volttron.platform.vip.agent.core ERROR: No response to hello message after 10 seconds.\n2020-11-11 10:41:39,776 () volttron.platform.vip.agent.core ERROR: Type of message bus used zmq\n2020-11-11 10:41:39,776 () volttron.platform.vip.agent.core ERROR: A common reason for this is a conflicting VIP IDENTITY.\n2020-11-11 10:41:39,776 () volttron.platform.vip.agent.core ERROR: Another common reason is not having an auth entry onthe target instance.\n2020-11-11 10:41:39,776 () volttron.platform.vip.agent.core ERROR: Shutting down agent.\n2020-11-11 10:41:39,776 () volttron.platform.vip.agent.core ERROR: Possible conflicting identity is: control.connection\ninstall: operation timed out\n' NoneType: None Traceback (most recent call last): File "scripts/install-agent.py", line 387, in install_agent(opts, opts.package, opts.config) File "scripts/install-agent.py", line 148, in install_agent out = execute_command(cmds, env=env, logger=log, File "/home/ubuntu/te/vtrn/volttron/platform/agent/utils.py", line 776, in execute_command raise RuntimeError() RuntimeError { "agent_uuid": "b9f900b4-8a34-410f-ab1c-f2fbc292fe82" } ---> Then the problem is me terminating this thread manually that leads to the ghost agent. However why is this happening? Maybe I can code a try loop or build error checking wrapper for the agent install.

MachinesJesus commented 3 years ago

It always seems to be on the second agent install FYI.

MachinesJesus commented 3 years ago

Can I alter the 10 seconds? "No response to hello message after 10 seconds." Can I make that 2 seconds?

craig8 commented 3 years ago

https://github.com/VOLTTRON/volttron/blob/125d7a0a47d0f2a1eaf130016a5197a787f52ceb/volttron/platform/vip/agent/core.py#L588 is where that wait is located.

This is in the main branch, but should be very close on the others if you want to modify it.

Note the other thing you might try is --skip-requirements. Assuming your agent has a requirements.txt file it will attempt to install those each time an install. It would be good to skip during a force install, but that's not the way the code is written currently.

MachinesJesus commented 3 years ago

Hey @craig8, Gave that a go (above) and also changed the timeout=10.0 to 2.0 on line 600. Rebuilt and installed. Weirdly I still think it took 10 secs. Also, I got 2 in a row for the first time. I'm gonna try a few more tests se if I can pinpoint the phenomena.

MachinesJesus commented 3 years ago

Yeah appears to always be the second agent install.