FaradayRF / Faraday-Software

Faraday node software
https://www.faradayrf.com
Other
48 stars 19 forks source link

proxy.py is not optimized for CPU consumption #138

Closed ghost closed 7 years ago

ghost commented 7 years ago

Summary

When used with an embedded computer like a Raspberry Pi Zero wireless, the process consumes 95% of the CPU.

EDIT @kb1lqd : Created PR #157

Problem Explanation

The CPU load is 95% even though there is no radio traffic. The sleep() finctions are generally 1 ms, which may be too low. I changed them to 100 ms but the CPU was still consuming 85% of the CPU.

Environment

Ubuntu Workstation on home network, connected to a Raspberry Pi Zero that will be remote near the antenna at altitude.

Software

Latest commit: https://github.com/FaradayRF/Faraday-Software/commit/19cb72217e6b6ee8e04527f8d969dc05a96ea936

Hardware

Raspberry Pi Zero running Debian on ARM6 hardware floating point:

Linux 4.4.50+ #970 Mon Feb 20 19:12:50 GMT 2017 armv6l GNU/Linux Python 2.7.9

Supporting Information

top

reillyeon commented 7 years ago

It would be nice to switch to waiting for the serial port to become readable instead of polling it for data. Unfortunately PySerial doesn't seem to support this cross-platform.

kb1lqc commented 7 years ago

Thanks for the report @ObjectToolworks! Yes we need to address this at some point. I am happy you made this issue ticket so it's on our radar. Currently we are focusing on the computer based applications and #135 to help catch errors better and improve our speed of development.

I like what @reillyeon suggests and polling is not the best way to go but it was simple at the time. MVP as they say!

I found this stackoverflow thread which could be relevant though maybe not cross platform? @ObjectToolworks this could be implemented in proxy.ini however line 340 of proxy.py needs to be updated to allow a "None" type through... Could be a good starting point?

kb1lqd commented 7 years ago

@ObjectToolworks @reillyeon @kb1lqc @jbemenheiser I've setup my RPi3 with Faraday and have APRS running (my unit is programmed with 0's for GPS location...)! I've confirmed that it certainly uses a LOT of CPU on the RPi3 as shown in the screenshot HTOP below.

I now have both a general RPi server up again (after moving) and a setup that mimicks closely @ObjectToolworks concerns.

Note: I'm letting it just run and I'm seeing a temperature of 72C with all needed for APRS using vcgencmd measure_temp command. This is uncomfortably hot haha but in all fairness just sitting and using a webbrowser on it rests at ~62C in a ~20C room. This means we raise it only 10C or so.

NOTE: Fun fact, quickly checking any zombie processes using a serial port in linux:

Proxy $ fuser /dev/ttyUSB0 
/dev/ttyUSB0:         4902 11316 11490

Then use kill to stop the processes

sudo kill -9 <...>

The correct way is the cleanly stop the threads and share resources...

2017-04-04-072138_1824x984_scrot

kb1lqc commented 7 years ago

@kb1lqd and @ObjectToolworks

BRB going to update FaradayRF.com to include turning RPis into "heaters" as a feature 😜

kb1lqd commented 7 years ago

@reillyeon What suggestions do you have to removing the polling functionality of basically all of faraday software's use of time.sleep(xxx) to "relax" the threads? Is there a python interrupt/message/call or somesort we can use to announce data is ready?

ghost commented 7 years ago

I was thinking it might entail a firmware upgrade. It needs a getQueueStatus() call, so the thread can wait forever and wake-up only when there is data. Same with sending it data, i.e., wait for buffer before sending to radio.

I'm thinking a blocking read() where the thread just does a wait() until data is ready. But then in the algorithm, check to see if there is enough bytes available where the getQueueStatus() would return the number of bytes waiting, a la FTDI, etc.

Steve

Here's some old Java code to describe what I mean:

import java.io.BufferedInputStream; import java.io.OutputStream; import com.ftdi.*; import java.io.IOException;

public class Receive implements Runnable {

private Thread dataReceive;
private FTDevice jd;
private CircularByteBuffer cb;
private OutputStream bout;
private BufferedInputStream bis;

/*
 * USB Constructor
 */
public Receive(FTDevice j, CircularByteBuffer c) {
    this.jd = j;
    this.cb = c;
    this.bout = cb.getOutputStream();

    dataReceive = new Thread(this);
    dataReceive.setName("Receive USB");
    dataReceive.setPriority(Thread.NORM_PRIORITY);
    dataReceive.start();
}

/*
 * Thread to read the data from the USB/Network port and buffer it
 */

public void run() {
    byte[] dataByte;
    int available, val;

    while (true) {
        /*
         * Check for new data
         */

        try {
            while ((available = jd.getQueueStatus()) > 0) {
                dataByte = new byte[available];

                /*
                 * This will block until USB device FTDI setTimeouts

setting */ val = jd.read(dataByte, 0, available);

                /*
                 * Write the data to Buffer Might be zero if read()
                 * timeout
                 */

                if (val > 0) {
                    bout.write(dataByte, 0, val);
                }
            }

            Thread.sleep(0, 1);
        } catch (IOException | InterruptedException e2) {
        }
    }
}

}

kb1lqd commented 7 years ago

We use the get_nowait() queue calls most places and this may be contributing: http://stackoverflow.com/questions/24764431/efficient-python-raw-input-and-serial-port-polling

reillyeon commented 7 years ago

There are two options, use non-blocking I/O and properly wait for the ports to be ready to send/receive data, or use blocking I/O with many threads that feed into queues. I generally prefer the former because it avoids having to deal with multi-threaded programming but I haven't found a good cross-platform Python library for doing this for serial ports. The later is the strategy in the StackOverflow post above and is probably what we should do.

kb1lqd commented 7 years ago

Using process hacker just running proxy on my laptop with a single unit using an average of 15% CPU:

image

ghost commented 7 years ago

I was reading this page about asynchronous calls:

http://twistedmatrix.com/trac/

ghost commented 7 years ago

oops, I forgot the drill-down link...

http://twistedmatrix.com/documents/8.2.0/core/howto/async.html

On Sun, Apr 9, 2017 at 1:27 AM, Steve coupaydeville@gmail.com wrote:

I was reading this page about asynchronous calls:

http://twistedmatrix.com/trac/

reillyeon commented 7 years ago

Awesome! I did not know that Twisted had support for serial ports but apparently it does: http://twistedmatrix.com/documents/current/api/twisted.internet.serialport.SerialPort.html

This is great because there is a library for Twisted that is very similar to Flask but works with its asynchronous I/O model. Put the two together and our whole data path from hardware to network can be done without any nasty blocking or polling.

kb1lqc commented 7 years ago

@reillyeon @kb1lqd @ObjectToolworks It appears that Twisted uses PySerial to accomplish this per this page.

That said it appears we'd have to move over from Flask to Twisted? I'm not sure this is a great idea but I'm all ears. It appears to be a lot of work! Especially if Twisted uses Pyserial anyways. Why not consider Flask + Celery? This discussion is definitely turning over some stones. Thanks!

ghost commented 7 years ago

I agree that you have an investment in Flask, and you're not likely to till that into the soil and start a new crop.

"Celery is an asynchronous task queue."

There you go!

kb1lqd commented 7 years ago

OK so I made ALL time.sleep(x) 100ms in:

The CPU usage went to <1%

image

Returning proxy.py only back to 1ms brought the CPU usage to ~5%

I believe the next step forward here is to:

This is in no particular order. Doing proxy.py first if possible may be easier and quick to get ~50% there and let @ObjectToolworks run the program on his Rpi Zero not at 90%!

kb1lqc commented 7 years ago

Did you at all consider looking at Celery?

reillyeon commented 7 years ago

An asynchronous task queue like Celery is not the solution to the problem we have.

Twisted uses PySerial but it also properly implements asynchronous reads and writes on top of it using its own internal asynchronous I/O functions.

kb1lqd commented 7 years ago

Proxy Areas For Improvement

Baseline: Proxy iterates through 0-255 port numbers regardless if it is OPEN to CHECK for new RX items Problem: Wastefull as only ~3 ports are ever open at once Should Be: Save list of current open ports and only check though. Or have high level "1 or more ports have data" flag and only check all if this is flagged.

image

Baseline: If RX Item available get the next item Problem: If > 1 item available you wait until the NEXT cycle to get it Should Be: Get ALL data items available and compute through logic

image

Baseline: Proxy iterates through 0-255 port numbers regardless if it is OPEN and executes a Try/Except statement Problem: Wastefull as only ~3 ports are ever open at once Should Be: Save list of current open ports and only check though. Might be wierd if the port is to be opened automatically by TX'ing data. Or have high level "1 or more ports have data" flag and only check all if this is flagged.

image

kb1lqd commented 7 years ago

Update: Updating proxy to only check ports 0-10 halved the CPU usage to ~6%

kb1lqd commented 7 years ago

So @reillyeon and @kb1lqc should I be looking into Twisted?

I think short term we can limit a few things and get proxy working on computers with > 50% reduction in CPU usage with just a few edits... might be worth it for the short run to buy us time to update to a better base.

reillyeon commented 7 years ago

Before we start trying to use new and unfamiliar libraries we can improve things a lot by tightening up the existing code. I don't understand the multiple layers of queues being polled in the proxy but it seems rather inefficient.

Based on my current understanding I think we should have one thread per serial port reading new messages from the hardware and placing them directly in the correct queues. The Flask endpoints can then read messages from those queues on demand. Sending messages can either block on the hardware in the main thread or be passed off to another set of per serial port threads. We should have no busy polling like we do today.

kb1lqd commented 7 years ago

@reillyeon @kb1lqc

I drew out a quick flow chart of the UART WORKER in proxy:

image

kb1lqc commented 7 years ago

@reillyeon and @kb1lqd I stand guilty of writing terribly inefficient code 😞 . It was a good idea at the time haha.

I like the idea of separate threads adding to the qeues for each serial port. Currently if you read the code this is NOT how it works. It literally iterates through COM port fd's in a dictionary. I am sorry 😆 but that too worked and I moved on.

kb1lqd commented 7 years ago

@kb1lqc The sentance for this crime is 1 day writing only in LISP. I hope you like parenthesis.

So I believe my tasks are to first:

kb1lqd commented 7 years ago

@reillyeon and @kb1lqc I just committed a basic update to proxy that should be starting 1 thread per unit and getting data. I did minimal to proxy and have NOT checked if Flask is working correctly:

https://github.com/FaradayRF/Faraday-Software/commit/01c0ff9ad70147c4bcd7d7123a514f0d6bae5752

Data is coming in from both threads and printing a message to screen from the thread: image

EDIT:

Looks like the FLASK interface is works:

image

EDIT 2:

Note that proxy still fills data into the global queues of getDict

image

kb1lqc commented 7 years ago

Quickly looking over your code there is only one thread...?

https://github.com/FaradayRF/Faraday-Software/commit/01c0ff9ad70147c4bcd7d7123a514f0d6bae5752#diff-ca2e71062d38d7881f1e58b072fc46edR473

Where do you spawn multiple threads @kb1lqd?

kb1lqd commented 7 years ago

@kb1lqc I start them right here: https://github.com/FaradayRF/Faraday-Software/commit/01c0ff9ad70147c4bcd7d7123a514f0d6bae5752#diff-ca2e71062d38d7881f1e58b072fc46edR469

ghost commented 7 years ago

Didn't see much change in Linux on Rpi-0 Wireless.

​ Tried it with 'nice'

​ Just to compare, this is just background processes:

​ This CPU might just be a dog.

73/steve

kb1lqd commented 7 years ago

Steve, I’m not sure what you tried but if it was my recent commit that created multiple threads then yeah I didn’t do any of the CPU saving items yet! I found the knobs we can turn a bit but mostly set the program up for modifications. I’ll let you know when you should expect improvements, hopefully in the next couple days.

Yeah the zero might be a bit of a push anyways but should work.

From: Steve [mailto:notifications@github.com] Sent: Tuesday, April 11, 2017 6:00 AM To: FaradayRF/Faraday-Software Faraday-Software@noreply.github.com Cc: Brenton Salmi kb1lqd@gmail.com; Mention mention@noreply.github.com Subject: Re: [FaradayRF/Faraday-Software] proxy.py is not optimized for CPU consumption (#138)

Didn't see much change in Linux on Rpi-0 Wireless.

​ Tried it with 'nice'

​ Just to compare, this is just background processes:

​ This CPU might just be a dog.

73/steve

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/FaradayRF/Faraday-Software/issues/138#issuecomment-293252751 , or mute the thread https://github.com/notifications/unsubscribe-auth/AAUsyCVaYzFGVGwaEHp-JnZROMnvUUMiks5ru3lNgaJpZM4Msfx2 .

ghost commented 7 years ago

No problem, sorry to provide non-useful feedback. I don't know where the images went :-)

On Tue, Apr 11, 2017 at 11:05 AM, Brenton Salmi notifications@github.com wrote:

Steve, I’m not sure what you tried but if it was my recent commit that created multiple threads then yeah I didn’t do any of the CPU saving items yet! I found the knobs we can turn a bit but mostly set the program up for modifications. I’ll let you know when you should expect improvements, hopefully in the next couple days.

Yeah the zero might be a bit of a push anyways but should work.

kb1lqd commented 7 years ago

OK so proxy by default make literally 256 queues and one-by-one checks each for new items...

image

image

I think this was intended to allow arbitrary data from RF or else to open ports as needed. Ports are opened simply if the check for them fails due to it not being present. I think the port should only be created if data is .put() into them and not there.

Layer_4_Service.py

image

kb1lqc commented 7 years ago

@kb1lqd All the queues were made for simplicity. The goal was to get a simple proxy program implemented and doing this prevented having to figure out the logic to open new queues mid-program.

kb1lqd commented 7 years ago

I updated the code to print the main dictionary of queues for ports while also updating to only open if data was attempted to be placed in but queue doesn't exist. This actually already happen and I simply commented out the opening of the queues after a failed .empty() queue check.

Commenting out queue creation if .empty() check fails and instead just returning a None:

image

When printing the dictionary of queues on each cycle the action clearly shows the queue created when the first telemetry packet arrives:

image

With commit: fd4bb290f4a51ccb719f8fef0dd69e898f0ef292

I can now just list all open ports and ONLY check those ports. Ports are opened by RX'ing or TX'ing data.

It looks like decent CPU usage reduction occurred.

image

kb1lqd commented 7 years ago

@ObjectToolworks Try commit https://github.com/FaradayRF/Faraday-Software/commit/fd4bb290f4a51ccb719f8fef0dd69e898f0ef292 and see if you notice reduction in CPU usage!

kb1lqd commented 7 years ago

Playing with the receive code it is apparent that when new data is available the loop gets one at a time and waits for the next loop to iterate through the items. This is inefficient because when delaying the cycle for CPU reduction OR fast data coming in proxy should get ALL data (or up to a limit per cycle) that is available to best utilize each cycle.

I slowed proxy down to ~5 seconds and watched the telemetry pile up.

image

kb1lqd commented 7 years ago

I updated the code to get all in queue as known. with a 5 second delay this meant that at most 2 packets were ever in the queue when it was looked at next rather than the constant build of queue items as shown before:

image

kb1lqd commented 7 years ago

OK I updated the Layer 2 and 4 UART proxy code to not only check if new data but to know the estimated amount of data in the TX/RX queues and get ALL data on a single cycle. This allows fast "sticato" thorughput with decoupled nature from loop timings. Simply, data is buffered and on a slower basis it is completely processed on a quick scale. This helps moves proxy updates forward with decoupled need for fast loop cycles.

CPU usage on my computer is now ~3%!

@ObjectToolworks Try b9d290e412245a177b2a2a514b5899cff6536380 on your Rpi!

kb1lqd commented 7 years ago

@ObjectToolworks @kb1lqc @jbemenheiser @reillyeon @jsr38 I am running proxy, telemetry, and APRS with the issue138 branch updates on my Raspberry Pi 3 and it is running at a much lower 14%! The temp of the processor is also a much cooler 45C.

image

image

Confirmed branch is working with both local and RF units: image

kb1lqd commented 7 years ago

@kb1lqc @reillyeon I'm debating just bringing this issue to a close if this is "good enough" for now, especially pending @ObjectToolworks RPI Zero results (may be diminishing returns though). It is already substantially better and I've added in updates and described that make Proxy more efficient and faster in obtaining/sending data.

This ticket is a bit vague to reach an endpoint so I'd like to limit the scope and move on if possible. This keeps merge small and reaches a reasonable goal. We can revisit later if needed for throughput as well but I'd like to get back to working on data transfer that will ultimately let us test Faraday/proxy for data throughput.

I now can just let Proxy run on my Rpi3 without trouble! I'd like to see long term run issues like APRS disconnecting or errors.

ghost commented 7 years ago

I tested the new changes and it was reading about 30 to 32 % which is way better than 99%. The time was mostly user time, with system time running about 8.4 and 64 % idle. For the purpose of remoting the Proxy this is quite acceptable. After all, it will only be used to run the Proxy, so the Workstation can connect to it.

I did not test the Proxy with an app yet.

Actually, I don't even like the wireless, as I need power up at the antenna, so I will probably use a different Pi so I can get the wired ethernet with PoE.

Thanks for your work on this!

kb1lqc commented 7 years ago

Glad to see this helping @ObjectToolworks and hopefully you can play around with and possibly experiment with setting up a base station! please keep us informed, this is fun stuff!

@kb1lqd Yes you can close this ticket if you want but please open up a new ticket that investigates moving Proxy over to Twisted and link to this IT #138 so this discussion isn't lost and we look at it in the future.

kb1lqd commented 7 years ago

Updates merged in #157