Ghidra server operations over high latency networks is prohibitively slow - Githubissues

NationalSecurityAgency / ghidra

Ghidra is a software reverse engineering (SRE) framework

https://www.nsa.gov/ghidra

Apache License 2.0

51.28k stars 5.84k forks source link

Ghidra server operations over high latency networks is prohibitively slow #2752

Open grant-h opened 3 years ago

grant-h commented 3 years ago

Describe the bug When using Ghidra server to collaborate on large binaries, if the users are above 100ms (ping) away from the server (i.e. you use the internet), checkouts and checkins take an extremely long time (on the order of 30 minutes). For the binary we are analyzing, this is unusual. We looked at the CPU usage for Java and its around 1%, we looked at wireshark and packets are flowing, but not very quickly, and finally disk I/O is around 500KB/s, far below the network line rate. This seems like the protocol needs optimization. I consider this a bug.

To Reproduce Steps to reproduce the behavior:

Host a ghidra server
Add over 100ms or network delay (artificial using tc on linux or real) between server and client machine
Connect to server make a shared project on client
Add a new binary to the project (on the order of megabytes)
Analyze it
Save it
Check it in
The check in should run slow.

Expected behavior Check in and check out should NOT be I/O bound, but instead be CPU bound.

Environment (please complete the following information):

OS: macOS and Linux and Windows
Java Version: 11
Ghidra Version: 9.2

ghidra1 commented 3 years ago

Sorry. While large blocks of data are compressed and pass over a separate data connection, a large portion of the interface utilizes Java RMI. The streaming buffer connection has been tested in high latency environments and has performed very well. It was separated from the RMI interface for this very reason. We sympathize with your situation but I'm not sure there is much we do to help.

grant-h commented 3 years ago

Interesting. So, this separate connection used for the mass transfer of checkin/checkout data? My server is behind a NAT, is it possible that this separate connection is failing and then it is falling back to RMI?

ghidra1 commented 3 years ago

Note that we have encountered similar behavior when fragmentation is excessive and re-transmit requests are high. In some situations the connections would hang. In my case this was resolved by ensuring that both client and server utilized a reduced MTU of 1500 and avoid jumbo frames and the resulting fragmentation. I suspect your issue is a network related problem caused by the large chunks of data passed by Ghidra.

ghidra1 commented 3 years ago

NAT related issues would cause a complete failure - not a slowdown. There is no fallback data connection over RMI.

grant-h commented 3 years ago

Both client and server are using MTU of 1500 as confirmed by wireshark.

NAT issues could cause complete failure

Okay that is what I thought. Is there anything I can do to further debug this? Flags or environment variables I can use to learn more about the slowdown?

ghidra1 commented 3 years ago

I my case I used wireshark and noted that fragmentation and re-transmits were occurring. Is there any sign of that?

grant-h commented 3 years ago

We are seeing a lot of TLS Ignored Unknown Record requests. Not all of the time. There are a lot of TLS frames made up of 530~ byte TCP packets. Not sure if this would be considered fragmentation though.

ghidra1 commented 3 years ago

Are you seeing re-transmission of TCP segments. This can occur is packets are lost in transmission. Have you verified the configured MTU on both sides of the connection or monitored the TCP window size negotiation. Be sure you are looking at the data connection and not the RMI connection (Default server TCP port is 13102).

In my case, MTU sizes were configured for jumbo frames which induced fragmentation. There was some bug in a router somewhere that was loosing these fragments. By lowering the MTU on the route I avoided the fragmentation and a smaller TCP window size was negotiated. The underlying network issue my never have been resolved, but I was able to work around it. In my case, it took a lot of tracking TCP segment IDs in a wireshark trace to understand that the same TCP segments were getting re-transmitted culminating in a hung connection.

grant-h commented 3 years ago

Just wanted to add some more numbers. On a checkout with 170071 total units, the max download speed of the checkout operation is around 100KB/s, far below my maximum download rate. There is nothing odd happening in wireshark. My network is fast for all other operations except this checkout step.

devingDev commented 1 month ago

Hello, I'm not sure if this is the same issue but I set up a server to work together on a project but it seems every time i do a check in and they do a "undo checkout", then re-open the elf which makes them check-out and get the changes i assume, it takes so long for them to finish that it's really unbearable like 15 minutes. Doing that on my side also does take too long imho although only like 20 seconds.

The server is in Germany and I'm also from Europe, they are from Japan which is probably why its being a lot slower but really this is too slow.

So can we do anything to improve this other than maybe finding a server inbetween us or similiar? Or am I understanding/doing this wrong?

The eboot.bin.elf which is a PSVita armv7 is about 20MB and the server says the folder is about 750MB big

ghidra1 commented 3 weeks ago

One would need to use a network sniffer to really understand the situation. In general, its use across high-latency and low bandwidth networks should be avoided. As mentioned above the Ghidra Server interface uses Java RMI which involves many server interactions apart from the bulk data transfer. In addition, reverse DNS queries if not properly setup can introduce issues as well. Based upon the amount of markup and additional information, the size of the database can be significantly larger than the original binary.