canonical / charmed-magma

Charmed Magma is an open source private mobile network operated with Juju.
https://canonical-charmed-magma.readthedocs-hosted.com/
4 stars 1 forks source link

AGW not checking into orchestrator. #22

Closed mounika-alavala closed 1 year ago

mounika-alavala commented 1 year ago

Hi, We have installed charmed magma orchestrator and AGW services on Openstack VMs with Ubuntu 20.04 OS, running behind proxies.

Orchestrator: microk8s version = 1.23 AGW: Version = 1.6.1

We are able to access NMS UI and all the AGW services are in active state. There are error logs in AGW services. Attached are the screenshots of same. MicrosoftTeams-image (8) MicrosoftTeams-image (9) MicrosoftTeams-image (10) MicrosoftTeams-image (11) MicrosoftTeams-image (12) MicrosoftTeams-image (13) MicrosoftTeams-image (14) MicrosoftTeams-image (15) MicrosoftTeams-image (16) MicrosoftTeams-image (17)

Even though orchestrator services come to "Active" and "Idle" state after installation, they tend to go to "maintenance" state after a day or so. They remain in that state. Even though same proxy values are used, not always same services go to "maintenance" state. Screenshot (1015)

End points of orchestrator are accessible from AGW, we used telnet to check the same. MicrosoftTeams-image (19)

As part of debugging section of documentation, we ran few python scripts to confirm if every prerequisite is satisfied. When we executed "checkin_cli.py" script, we found out that gateway certificate and gateway key are missing. Restarting magma services didn't help in regenerating the certificate and key. MicrosoftTeams-image (18)

We tried checking AGW to orchestrator with correct hardware details. But it's not checking in and the status in NMS UI is "Bad".

Any help will be appreciated. Thanks in advance.

sanchezfdezjavier commented 1 year ago

Hi @mounika-alavala, thanks for opening an issue. Would you mind providing the output of juju debug-log --replay, and since when do you have it deployed?

mounika-alavala commented 1 year ago

Hi Thanks for the reply. We have the setup since 5 days. Attached are the debug logs of orchestrator. juju_replay.txt

gruyaume commented 1 year ago

There may be multiple issues here:

  1. The fact that the orchestrator application go from active to maintenance
  2. Connectivity between AGW and Orchestrator / Status of AGW

@mounika-alavala Do you mind connecting to one of the Kubernetes pods that is associated to a Maintenance status application and telling me if the workload service is running?

Example for orc8r-device

kubectl exec -ti orc8r-device-0 -c magma-orc8r-device -n <your model name> -- bash
ps -ef
gruyaume commented 1 year ago

I don't think that your bug is related to this but I also observed a bunch of error logs that shouldn't be there. Here's the PR to fix this.

gruyaume commented 1 year ago

Also, I see that the output of the debug-log command is cut short. Is there a way to get all the logs from the deployment? We may want to filter on errors: juju debug-log --replay --level ERROR.

DeepakalaB commented 1 year ago

There may be multiple issues here:

  1. The fact that the orchestrator application go from active to maintenance
  2. Connectivity between AGW and Orchestrator / Status of AGW

@mounika-alavala Do you mind connecting to one of the Kubernetes pods that is associated to a Maintenance status application and telling me if the workload service is running?

Example for orc8r-device

kubectl exec -ti orc8r-device-0 -c magma-orc8r-device -n <your model name> -- bash
ps -ef

Hi @gruyaume , I am a colleague of @mounika-alavala . Please find attached a screenshot showing connect with the pods in maintenace state. Screenshot (1512)

mounika-alavala commented 1 year ago

Hi Thanks for reply. Attached is the error log. error.txt

ghislainbourgeois commented 1 year ago

Hi @mounika-alavala, first of all, thank you for providing the logs. I have analyzed them and did not find any clues as to what is happening with the AGW. Can you please share the configuration that was used to deploy the AGW, and screenshots of its configuration inside the NMS?

Also, can you run this command and post its output while on the Juju model for the AGW: juju run magma-access-gateway-operator/0 post-install-checks.

Thanks!

mounika-alavala commented 1 year ago

Hi Thanks for the reply. Attached are the screenshots of the information you requested. Please note that we deployed all these on top of Openstack VMs behind proxy. Screenshot (1026)

Screenshot (1027)

Screenshot (1028)

post

ghislainbourgeois commented 1 year ago

I do not think the SGi addresses should be in the same range as the IP Block. The IP Block is used to give out addresses to the UEs.

I think in this version of Magma, the UI to configure the EPC is a bit difficult to follow. Here is what I think you should configure:

IP Block: Private range of IPs that will be used by UE DNS Primary & Secondary: DNS servers that can be reached by the AGW through eth0 SGi network Gateway IP address: The default gateway that the AGW will use on eth0 SGI management interface IP address: IP and netmask configured on eth0

I will look into improving the documentation for this, and maybe see if we can change this page in charmed-magma.

mounika-alavala commented 1 year ago

Hi Thanks for the reply. Even when IP block and SGi address are on different range it still gave out the same error. DNS server can be reached through eth0. We did set SGi network gateway and management IP correctly. We are sure about this because we have setup magma on baremetal servers before. Now we are trying on top of Openstack VMs with proxies.

ghislainbourgeois commented 1 year ago

OK, in that case, are you able to take a network capture of the traffic between the AGW and ORC8R? Nothing in the logs you provided indicates any issues. Ideally, we would require a capture taken on AGW and another on the ORC8R, but if you are only able to get one, let us start with the AGW.

mounika-alavala commented 1 year ago

Attached are the network captures. No traffic via eth0 in AGW. tcpdump_agw_eth1_16feb_23.txt tcpdump_orc8r_16fec23.txt

ghislainbourgeois commented 1 year ago

Thank you very much for providing those. While looking into the AGW capture, I found the issue is related to the proxy. The proxy is giving out errors like these: <p>The following error was encountered while trying to retrieve the URL: <a href="https://bootstrapper-controller.5gmagmatest.com/*">https://bootstrapper-controller.5gmagmatest.com/*</a></p>.

My guess is that the proxy does not know how to resolve the domain name 5gmagmatest.com. I am not sure how much control you have on that proxy, but a solution could be to add the domains to its hosts file. If this is not possible, I would suggest using a domain name backed by whatever DNS server the proxy is using.

Let me know if this helps.

mounika-alavala commented 1 year ago

We already added the domains to the hosts file. Using telnet we also verified the access and it was able to resolve the domain name. image

ghislainbourgeois commented 1 year ago

When using telnet, it will bypass the proxy. You could try something like this to test with the proxy:

export https_proxy="<proxy url>"
curl https://bootstrapper-controller.5gmagmatest.com/

In this particular case, you could not use the proxy at all and it should work. However, if your goal is to test through the proxy, the proxy server itself needs to be able to resolve those domain names and connect to those ports.

mounika-alavala commented 1 year ago

Hi We have tested using curl too. We actually added the domain (5gmagmatest.com) to “no_proxy”.

ghislainbourgeois commented 1 year ago

If you add the domain to no_proxy before testing with curl, it will only test the direct connection. I am curious how the AGW is trying to connect through the proxy, as the capture clearly shows. Did you configure the proxy anywhere on the VM?

mounika-alavala commented 1 year ago

Yes we have configured proxy on the VM. Proxies are set in : .bashrc, /etc/environment, /etc/wgetrc

ghislainbourgeois commented 1 year ago

OK, so the issue is that the proxy server does not know 5gmagmatest.com and cannot forward traffic to it. You can either disable the proxy globally on the VM, or go in the proxy server itself and configure the hosts file and network so that it can resolve and connect to 5gmagmatest.com.

mounika-alavala commented 1 year ago

We tested using curl too. 100 101 102 103 104

ghislainbourgeois commented 1 year ago

Right now, your test with curl set the no_proxy variable, so they only show direct connectivity. The Magma AGW is taking the proxy setting that is configured globally however, and it sends its traffic through the proxy server.

You have 2 options to fix the issue:

  1. Remove all proxy configuration from the VM and restart the AGW.
  2. Contact the proxy administrator and ask them to configure the proxy server itself to connect to your domain name.
mounika-alavala commented 1 year ago

Hi We set proxies and no proxy in VM like below. We added these proxies in ".bashrc, /etc/environment, /etc/wgetrc". proxy We also made a host entry in "/etc/hosts" hosts As mentioned above, request to orchestrator services from AGW, are not going via proxy, they are connecting directly.

ghislainbourgeois commented 1 year ago

Unfortunately, it seems that AGW is not taking no_proxy into account, because the capture that you provided shows traffic going through the proxy.

agw_proxy

mounika-alavala commented 1 year ago

Hi Can you guide us on how to set no_proxy variables so that AGW picks it up?

ghislainbourgeois commented 1 year ago

Hi, can you try running:

sudo snap unset system proxy.http
sudo snap unset system proxy.https
sudo service magma@* stop
sudo service magma@magmad restart

I have looked at snap settings, and there does not seem to be an option for setting no_proxy there. If this works, I can raise an issue to add this feature.

If it does not work, please share the exact content of /etc/environment.

DeepakalaB commented 1 year ago

Hi, Thanks, I tried running the commands you had mentioned but agw is still not picking up the no-proxy settings. please find attached the exact content of /etc/environment

image

ghislainbourgeois commented 1 year ago

Where is the variable $PROXY_NO defined? It is not in /etc/environment, so that might be the issue. I see from previous comments that is is generated dynamically in a subshell, so my guess would be that this is only in .bashrc or similar, and this will not apply to AGW.

Please add define no_proxy directly in /etc/environment.

DeepakalaB commented 1 year ago

oh okay. You are right. I have defined no_proxy in .bashrc. Let me try adding it directly to /etc/environment

DeepakalaB commented 1 year ago

Thanks, After setting no_proxy in /etc/environment and restarting magmad services on agw, I am able to see that gateway.crt and gateway.key files have been successfully generated under /var/opt/magma/certs directory.

image

But I am still seeing magmad errors in the logs and also agw has still not checked-in with the orc8r. image image

I ran checkin_cli.py to debug the issue : have attached the screenshot below. image

Can you help me with what I am missing here?

ghislainbourgeois commented 1 year ago

Unfortunately, those logs are not really telling much. Can you provide a network capture on the AGW like what was provided before?

DeepakalaB commented 1 year ago

Sorry. Please find attached tcpdump from agw: tcpdump_agw_eth1_17feb_night.log

Hope this helps.

DeepakalaB commented 1 year ago

I am still seeing agw trying to go via proxy to connect to 5gmagmatest.com from the tcpdump from agw.

ghislainbourgeois commented 1 year ago

There is no indication in the documentation for upstream Magma 1.6 that installing the AGW behind an HTTP proxy is supported. I would like to better understand what you are trying to achieve with this setup.

I think removing the global configuration and testing that way would be the best way forward. Afterwards, if UEs going through the AGW need to go through an HTTP proxy, I think the Header Enrichment feature can be used for similar reasons.

Since the AGW behaves like a router, traffic from the UE would not go directly through the proxy with this setup even if we were able to make it check in to the orc8r. The proxy configuration would need to be done on the UE.

DeepakalaB commented 1 year ago

Hi, We are trying to setup a private 5g cloud on MAAS/openstack VMs. Openstack VMs do not have direct internet access for security reasons. So they have to use the MAAS proxy only.

For agw to communicate with orc8r services, so we have to make sure that all the orc8r services/nodes are part of the no-proxy list in /etc/environment and .bashrc files. We are still trying to figure out why agw is not picking up the no-proxy settings from /etc/environment .

agw gets its proxy settings from /etc/environment, why is agw not picking up the no-proxy settings ? That seems to be root cause of this problem. If agw can pick up the no-proxy settings, it would skip the proxy and connect directly to the orc8r services.

And magma-access-gateway.configure ran successfully and generated gateway.crt and gateway.key files. For this, I assume agw should be able to access bootstrapper-controller.5gmagmatest.com bypassing the proxy server . So if agw is able to access one serivce, namely bootstrapper, why is it not doing the same with the other services (controller, fluentd)?

ghislainbourgeois commented 1 year ago

Hi, magma-access-gateway.configure does not require access to bootstrapper-controller, it basically creates the configuration files and places the certificates in the right place.

One thing we can do to validate is try to see what Magma sees for proxy configuration. You can find the PID of the MME service:

systemctl status magma@mme.service

The main PID will be in the output. You can then use the PID in this command:

cat /proc/<PID>/environ

With this, we will be at least able to see the environment view from the process point of view.

mounika-alavala commented 1 year ago

Hi It is having "no_proxy" values. image

ghislainbourgeois commented 1 year ago

I will try to replicate the setup locally to see if I can reproduce the issue and debug further.

mounika-alavala commented 1 year ago

Ok Thank you.

DeepakalaB commented 1 year ago

Hi @ghislainbourgeois , Thanks a lot! As suggested by you, we removed all the proxy variables from /etc/environment after install and it worked! Agw is now able to check-in to the orc8r.

Screenshot (1546) Screenshot (1547) image

ghislainbourgeois commented 1 year ago

I am glad that it worked. On my side, I recreated a similar setup, and was able to reproduce the behaviour. The AGW gateway is able to bootstrap, but does not check-in properly afterwards and never shows up as Good in the Orchestrator.

I will use this setup to dig a bit deeper and understand why it only partly works.

In your setup, do you connect an enodeB and some UEs for testing? Does the current setup let you do anything useful with the AGW?

mounika-alavala commented 1 year ago

Hi We setup SRSRAN on another Openstack VM. It did setup gtp tunnel, but it is not functional. Uplink and downlink are not working. We did a few checks like:

  1. Enabling ping in "pipelined.yml" file in AGW machine.
  2. Checked if "nat_iface" is reachable.
  3. If a tunnel is getting created on SRSRAN VM.
  4. Kernel version All these are working and correct, but we couldn't still make it work. Attached are the screenshots of the issue we are facing and SRSRAN details. Thank you. UE MicrosoftTeams-image (35) eNodeB MicrosoftTeams-image (34) ovs dump flow MicrosoftTeams-image (31) MicrosoftTeams-image (33) ping UE from AGW MicrosoftTeams-image (32) ping AGW from UE MicrosoftTeams-image (30)
ghislainbourgeois commented 1 year ago

That is weird, I would expect ping to work in this case, but not much else. Can you provide the output of the following command on AGW and also a network capture on AGW:

ip -br -c a
ip route get 192.168.128.18
tcpdump -i any -s0 -w agw.pcap icmp

On my side, I have made some progress regarding the proxy setup. It turns out that magmad takes the settings properly, but control-proxy, running nghttpx is configured with the proxy, but unfortunately does not support no_proxy. Its configuration gets created automatically on each magmad startup, so there is no easy fix there.

I thus think that the official recommendation when behind a proxy is to ensure that no proxy is configured in /etc/environment, and configured directly for other applications requiring proxy (ie. apt for package upgrades).

mounika-alavala commented 1 year ago

Attached are the details you asked for. image image Trace from srsenb while running ping from both UL and DL directions. MicrosoftTeams-image (36) UE attach successful MicrosoftTeams-image (37) UE IP assigned in namespace ue1 MicrosoftTeams-image (38)

Note: agw.pcp renamed to agw.log because *.pcap not supported agw.log

ghislainbourgeois commented 1 year ago

I think in this case the problem is in the networking setup of the AGW. The traffic from the UE should come and go through the virtual network interface gtp_br0, and not directly out on the S1 interface (eth1).

I think the problem is that the IP Block configured in the orchestrator for this AGW is wrong. It should probably be 192.168.128.0/24. Can you try changing this setting, restarting the AGW and retrying again with srsRAN?

mounika-alavala commented 1 year ago

Hi Even when IP block is 192.168.128.0/24, issue still remains the same.

ghislainbourgeois commented 1 year ago

I think you mentioned that you already set block_agw_local_ips=false in /etc/magma/pipelined.yml, right? If it is not, please set this and restart the magma@pipelined service.

Can you also provide a network capture directly on the gtp_br0 interface when the IP block is set to 192.168.128.0/24? It will help narrow down the problem.

mounika-alavala commented 1 year ago

Yes we did set it to false.

We captured only for a little while. Now its assigning only in 192.168.30 subnet.

1   0.000000 192.168.30.128 → 10.250.110.36 ICMP 100 Echo (ping) request  id=0x2a2c, seq=1/256, ttl=64     2   0.003793 10.250.110.36 → 192.168.30.128 ICMP 100 Echo (ping) reply    id=0x2a2c, seq=1/256, ttl=63 (request in 1)     3  59.794959 192.168.128.1 → 192.168.30.22 ICMP 100 Echo (ping) request  id=0x2a2d, seq=1/256, ttl=64     4  60.823803 192.168.128.1 → 192.168.30.22 ICMP 100 Echo (ping) request  id=0x2a2d, seq=2/512, ttl=64     5  61.043090 192.168.30.128 → 10.250.110.36 ICMP 100 Echo (ping) request  id=0x2a2e, seq=1/256, ttl=64     6  61.047283 10.250.110.36 → 192.168.30.128 ICMP 100 Echo (ping) reply    id=0x2a2e, seq=1/256, ttl=63 (request in 5)     7  61.847800 192.168.128.1 → 192.168.30.22 ICMP 100 Echo (ping) request  id=0x2a2d, seq=3/768, ttl=64     8  62.871741 192.168.128.1 → 192.168.30.22 ICMP 100 Echo (ping) request  id=0x2a2d, seq=4/1024, ttl=64     9  63.899727 192.168.128.1 → 192.168.30.22 ICMP 100 Echo (ping) request  id=0x2a2d, seq=5/1280, ttl=64    10 122.085721 192.168.30.128 → 10.250.110.36 ICMP 100 Echo (ping) request  id=0x2a2f, seq=1/256, ttl=64    11 122.089748 10.250.110.36 → 192.168.30.128 ICMP 100 Echo (ping) reply    id=0x2a2f, seq=1/256, ttl=63 (request in 10)    12 183.125904 192.168.30.128 → 10.250.110.36 ICMP 100 Echo (ping) request  id=0x2a30, seq=1/256, ttl=64    13 183.130500 10.250.110.36 → 192.168.30.128 ICMP 100 Echo (ping) reply    id=0x2a30, seq=1/256, ttl=63 (request in 12)    14 244.166527 192.168.30.128 → 10.250.110.36 ICMP 100 Echo (ping) request  id=0x2a31, seq=1/256, ttl=64    15 244.171539 10.250.110.36 → 192.168.30.128 ICMP 100 Echo (ping) reply    id=0x2a31, seq=1/256, ttl=63 (request in 14)    16 254.801842 10.250.110.104 → 192.168.30.128 ICMP 104 Destination unreachable (Network unreachable)    17 254.803360 10.250.110.104 → 192.168.30.128 ICMP 104 Destination unreachable (Network unreachable)    18 257.848888 10.250.110.104 → 192.168.30.128 ICMP 104 Destination unreachable (Network unreachable)    19 260.888801 10.250.110.104 → 192.168.30.128 ICMP 104 Destination unreachable (Network unreachable)

ghislainbourgeois commented 1 year ago

I still think the network setup is not correct. I think you should have 192.168.128.0/24 setup in the IP Block setting, then:

# Stop UE and enodeB
systemctl stop magma@*
systemctl start magma@magmad
# Start enodeB
# Start UE

Then, the UE should get an IP in the range 192.168.128.0/24. You should be able to ping the AGW from the UE with this command: ping 192.168.128.1. And you should be able to ping the UE from the AGW using this command: ping 192.168.128.x (replace x with the right number from the UE attach message).

In your last messages, you seem to be pinging between different networks.

DeepakalaB commented 1 year ago

Thanks @ghislainbourgeois . We are able to ping the agw from UE and vice versa :-) image

Screenshot (1587)