fennb / phirehose

PHP interface to Twitter Streaming API
709 stars 189 forks source link

Script runs fine, then stops anything, high CPU #84

Closed SBrownAU closed 9 years ago

SBrownAU commented 9 years ago

Version: 1.0RC PHP: 5.5.22

Issue: Script will be happily running fine for a period of days, and then, will appear to stop doing anything (but continues to still be "running"). I can tell this because output to log files stop + db activity.

How script is started:

$ php -l /www/script >> script.log 2>&1

ps output

user    11173 48.9  1.8 353568 37212 ?        R    May17 1955:24 php /www/script -l

lsof output

$ lsof -p 1234
php     11173 user    0w   CHR                1,3      0t0     5121 /dev/null
php     11173 user    1w   REG              202,0   509449  3539034 /www/logs/script.log
php     11173 user    2w   REG              202,0   509449  3539034 /www/logs/script.log
php     11173 user    3u  IPv6           57489117      0t0      TCP localhost:48792->localhost:memcache (ESTABLISHED)
php     11173 user    4u  unix 0xffff880077c6b800      0t0 57511594 socket
php     11173 user    5u  IPv4           58305730      0t0      TCP my.server:1234->r-199-59-148-138.twttr.com:https (ESTABLISHED)

strace output

$ strace -p 1234
read(5, 0x2566554, 2737)                = -1 EAGAIN (Resource temporarily unavailable)
(continually...)

Not really sure what the next steps I can take are. Annoying having to periodically kill script and re-start, so must be something I'm missing/bug. From my limited server skills, it looks to be continually reading from the TCP connection to Twitter, but not doing anything else?

DarrenCook commented 9 years ago

Version: 1.0RC PHP: 5.5.22

Regarding high CPU, could it be the same problem (a PHP bug that will be fixed in php 5.6.9) described here: https://github.com/fennb/phirehose/issues/83

However a comment on bug 69402 says they saw it happen on 5.5.23, but not on 5.5.22.

Darren

SBrownAU commented 9 years ago

Yeah, I'd looked at that issue/bug report, does sound similar I agree, but versions don't match up. Although the offending line of code looks like a candidate for the strace output.

fennb commented 9 years ago

Hey @SBrownAU, this is probably the best bug report I've seen, thank you :)

Sadly, I don't have much immediate insight as to the probable cause, other than it's (unfortunately) more likely to be system (ie: PHP/openssl) related than Phirehose.

I have seen similar behaviours caused by PHP being compiled against specific versions of openssl, for the HTTPS capabilities required to connect to the twitter stream.

To prove one way or another, if you can run your script on an entirely different platform/PHP version and not get the same issue, then sadly it's time to dig into PHP version (and related) bugs.

SBrownAU commented 9 years ago

Thanks, I know how hard it is to diagnose issues without useful information up front ;)

Agree'd though, likely a PHP related issue and not the code itself. Closing issue, will re-visit if I find solution/cause. For now, I'll implement a work around to periodically check/kill/restart or similar, thanks.

vishymuku commented 9 years ago

I also think this is to do with the bug in PHP, but if there's a way to tackle this bug, it'll be of great help to all. So am posting my findings using strace.

This is strace -c -p output after attaching the process for a minute.

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00    1.432918           4    392333    392266 read
  0.00    0.000000           0         1           write
  0.00    0.000000           0        37           poll
  0.00    0.000000           0        11           select
  0.00    0.000000           0         4           sendto
  0.00    0.000000           0         4           recvfrom
  0.00    0.000000           0        20           gettimeofday
------ ----------- ----------- --------- --------- ----------------
100.00    1.432918                392410    392266 total

I think the high CPU usage is because of the huge number of read errors.

I'm not an expert, so pardon me if my question is naive. Is there a way to backoff when this error occurs?

ENV: PHP 5.6.8, Amazon Linux EC2 t2.micro instance. The process is currently running at ~45% cpu usage.

fennb commented 9 years ago

It's a great question - Basically it looks like some part of the code gets stuck in a loop calling read() on the open file handle, but (as above) is constantly returning an error:

-1 EAGAIN (Resource temporarily unavailable)

Depending on where in the code this is happening, it may be possible to detect this at the PHP level and recover.

Based on a very quick google, and reading this: http://stackoverflow.com/questions/14370489/what-can-cause-a-resource-temporarily-unavailable-on-sock-send-command it appears the EAGAIN error is only returned on sockets that are set up for non-blocking reads.

On reviewing the code, I realised there seems to be some confusion as to whether we use blocking or non-blocking reads: https://github.com/fennb/phirehose/blob/master/lib/Phirehose.php#L417 - The socket is set to non-blocking, hence the use of stream_select (http://php.net/manual/en/function.stream-select.php - note warning on CPU usage), however, some of the subsequent fgets() calls seems to presume regular/blocking reads potentially.

Unfortunately, a lot of this code was contributed over time by others (thank you!) so I'm not 100% familiar with all of it, but it's definitely possible there's a bug lurking in there somewhere.

It'd be very handy to see where in the code the issues are happening (ie: what loop it's in during the high-cpu situations) but that can be somewhat difficult to produce.

vishymuku commented 9 years ago

Thanks for the detailed response @fennb. Your inputs and a few links (listed at the end) have helped me find a probable working solution.

I added this line to make the blocking true for the stream:

 stream_set_blocking($this->conn, TRUE);

Now CPU usage is down to <0.4%!

I tested it and I am able to get the tweets into database. But I don't know what the downside of blocking will be. As of now, I'll go ahead with this.

Here are the links which helped me: an old bug(with funny comments at the end): https://bugs.php.net/bug.php?id=40566 from php manual: http://php.net/manual/en/function.stream-set-blocking.php

DarrenCook commented 9 years ago

From memory, the problem with setting blocking on is that you won't get heartbeat() and statusUpdate() called reliably. Perhaps more importantly, if the stream goes quiet there will never be a chance to detect this and call reconnect(). (I'm not sure on that, and don't have the time to investigate further, but I'd be cautious doing a patch unless testing in both lightly loaded and heavily loaded situations; especially if this is just to work around a bug that PHP core has already fixed and is about to release.)

fennb commented 9 years ago

@DarrenCook yep, this sounds right to me (and is the point of non-blocking and stream_select() from memory).

fennb commented 9 years ago

Mind you, if the timeout still works on stream_select(), then it's actually probably OK if it's a blocking read - you'd need to test to be sure one way or another.

It appears there may be several bugs around using stream_select() combined with encrypted streams: http://php.net/manual/en/function.stream-select.php#115400

vishymuku commented 9 years ago

@DarrenCook , @fennb Thanks for the inputs. PHP 5.6.9 is released. To be safe, I'll upgrade to this version and remove blocking. I'll post the findings. Hope everything goes well.