DocNow / diffengine

track changes to the news, where news is anything with an RSS feed
MIT License
177 stars 30 forks source link

staleness heuristic / performance #9

Open edsu opened 7 years ago

edsu commented 7 years ago

The longer diffengine runs the more urls it needs to check, and the more time it takes to take a full pass through them. The assumption I've had so far based on watching news websites is that the older a page gets the less likely it is to change its content.

There is a method on the Entry object that calculates whether an entry is stale or not. It uses what I call a staleness ratio or s. If s is greater than a given value (currently .2) it is deemed stale. I've thought about making this magic number configurable per feed. Here's how it works:

hotness = current time - entry creation time
staleness = current time - time last checked
s = hotness / staleness

stale? = s >= .2

So if an entry is 3 hours (10800 seconds) old and it was last checked 20 minutes (1200 seconds) ago, the calculation is:

1200 / 10800 = .11 (not stale)

Or if the entry is 3 hours old and it was last checked 1 hour ago:

3600 / 10800 = .33 (stale)

The idea is that things get checked less often as they get older, but the problem that I haven't really verified yet is that I think it can still result in thresholds over which lots of checks need to happen. So periodically diffengine will spend a lot of time checking URLs as they cross over that threshold.

I was wondering if it might make sense to take a more probabilistic approach where URLs are checked more often when they are new and less often as they get older using some sort of probability sampling. For example when an entry is new it is checked 80% of the time, and as it gets to be old, say a month old, it is checked only 50% of the time. So a gradiant of some kind like that? Or maybe it should also factor in the total number of entries that need to be checked, and the desired time it should take for a complete run?

It takes about a second to check an entry, and after running against Washington Post the Guardian and Breitbart for a week I have 1531 URLs to check. If there were no backing off at all this be 25 minutes of runtime, and it would just get worse. This would mean that new entries would not be monitored closely enough. Also it would unduly burden the webserver being checked with tons of requests.

I suspect this problem may have been solved elsewhere before, so if you have ideas or pointers they would be appreciated!

gregjan commented 7 years ago

I don't have insight into the stats myself. Do these servers respond to HEAD requests accurately? That should be faster than a diff..

ruebot commented 7 years ago

Some data points from my setup.

CBC - 42 feeds

2017-01-14 10:12:20,055 - root - INFO - shutting down: 0:51:27.529976
2017-01-14 13:32:35,568 - root - INFO - shutting down: 1:02:34.198437
2017-01-14 15:24:19,525 - root - INFO - shutting down: 0:54:17.493559
2017-01-14 17:09:52,855 - root - INFO - shutting down: 0:54:50.911007
2017-01-14 18:11:45,281 - root - INFO - shutting down: 0:56:43.139898
2017-01-14 19:09:45,958 - root - INFO - shutting down: 0:54:44.392014
2017-01-14 20:09:54,507 - root - INFO - shutting down: 0:54:53.252183
2017-01-14 21:11:51,047 - root - INFO - shutting down: 0:56:49.261002
2017-01-14 22:13:16,338 - root - INFO - shutting down: 0:58:15.006552
2017-01-14 23:11:30,542 - root - INFO - shutting down: 0:56:28.920192
2017-01-15 00:12:25,490 - root - INFO - shutting down: 0:57:23.657105
2017-01-15 01:11:45,595 - root - INFO - shutting down: 0:56:43.754114
2017-01-15 02:06:40,166 - root - INFO - shutting down: 0:51:38.215280
2017-01-15 03:05:54,218 - root - INFO - shutting down: 0:50:52.668085
2017-01-15 04:04:57,935 - root - INFO - shutting down: 0:49:56.339932
2017-01-15 05:05:25,381 - root - INFO - shutting down: 0:50:23.064285
2017-01-15 06:06:06,482 - root - INFO - shutting down: 0:51:04.715768
2017-01-15 07:06:25,907 - root - INFO - shutting down: 0:51:24.007759
2017-01-15 08:10:25,069 - root - INFO - shutting down: 0:55:22.783080
2017-01-15 09:15:26,977 - root - INFO - shutting down: 1:00:25.543582

The Globe & Mail - 174 feeds

2017-01-14 09:47:27,868 - root - INFO - shutting down: 0:17:25.983662
2017-01-14 10:17:42,696 - root - INFO - shutting down: 0:17:40.530126
2017-01-14 10:47:57,171 - root - INFO - shutting down: 0:17:55.090977
2017-01-14 11:17:06,091 - root - INFO - shutting down: 0:17:04.567858
2017-01-14 11:47:55,512 - root - INFO - shutting down: 0:17:54.131806
2017-01-14 12:18:14,028 - root - INFO - shutting down: 0:18:12.170724
2017-01-14 12:48:34,178 - root - INFO - shutting down: 0:18:32.800043
2017-01-14 13:17:40,891 - root - INFO - shutting down: 0:17:39.432581
2017-01-14 19:50:17,275 - root - INFO - shutting down: 5:20:15.156602
2017-01-14 23:24:06,302 - root - INFO - shutting down: 2:49:04.247076
2017-01-15 02:32:11,626 - root - INFO - shutting down: 2:57:09.492647
2017-01-15 05:16:41,393 - root - INFO - shutting down: 2:41:39.316417
2017-01-15 08:28:26,709 - root - INFO - shutting down: 2:53:25.176995

La Presse - 272 feeds

2017-01-14 21:05:31,423 - root - INFO - shutting down: 0:00:28.450089
2017-01-14 21:20:20,473 - root - INFO - shutting down: 0:00:19.141869
2017-01-14 23:25:41,218 - root - INFO - shutting down: 0:07:51.681414
2017-01-14 23:52:57,912 - root - INFO - shutting down: 0:08:00.091925
2017-01-15 02:22:17,085 - root - INFO - shutting down: 2:02:15.088546
2017-01-15 04:21:39,277 - root - INFO - shutting down: 1:01:37.524389
2017-01-15 06:21:21,182 - root - INFO - shutting down: 1:01:19.278733
2017-01-15 08:21:24,028 - root - INFO - shutting down: 1:01:21.583565

Toronto Sun - 61 feeds

2017-01-14 12:01:08,299 - root - INFO - shutting down: 0:01:06.367782
2017-01-14 12:04:57,990 - root - INFO - shutting down: 0:01:24.518744
2017-01-14 12:06:05,057 - root - INFO - shutting down: 0:00:30.213326
2017-01-14 14:32:02,813 - root - INFO - shutting down: 1:02:00.581537
2017-01-14 15:51:12,505 - root - INFO - shutting down: 0:41:10.570579
2017-01-14 16:49:51,841 - root - INFO - shutting down: 0:39:49.905569
2017-01-14 17:49:13,600 - root - INFO - shutting down: 0:39:11.447067
2017-01-14 18:50:00,379 - root - INFO - shutting down: 0:39:58.324235
2017-01-14 19:50:15,777 - root - INFO - shutting down: 0:40:14.511174
2017-01-14 20:52:23,072 - root - INFO - shutting down: 0:42:21.243300
2017-01-14 21:52:13,635 - root - INFO - shutting down: 0:42:11.479865
2017-01-14 22:50:59,846 - root - INFO - shutting down: 0:40:58.384031
2017-01-14 23:51:05,598 - root - INFO - shutting down: 0:41:03.421256
2017-01-15 00:51:21,725 - root - INFO - shutting down: 0:41:19.825083
2017-01-15 01:52:39,583 - root - INFO - shutting down: 0:42:37.540521
2017-01-15 02:50:35,141 - root - INFO - shutting down: 0:40:33.583783
2017-01-15 03:52:36,677 - root - INFO - shutting down: 0:42:35.075756
2017-01-15 04:51:29,327 - root - INFO - shutting down: 0:41:26.991347
2017-01-15 05:50:03,298 - root - INFO - shutting down: 0:40:01.526480
2017-01-15 06:52:04,578 - root - INFO - shutting down: 0:42:02.599009
2017-01-15 07:51:25,814 - root - INFO - shutting down: 0:41:23.516856
2017-01-15 08:52:46,569 - root - INFO - shutting down: 0:42:45.185334

Canadaland - 4 feeds

2017-01-14 21:24:59,903 - root - INFO - shutting down: 0:24:57.804095
2017-01-14 21:55:25,131 - root - INFO - shutting down: 0:25:22.970673
2017-01-14 22:24:50,952 - root - INFO - shutting down: 0:24:49.555501
2017-01-14 22:55:09,853 - root - INFO - shutting down: 0:25:08.089897
2017-01-14 23:25:11,017 - root - INFO - shutting down: 0:25:08.898734
2017-01-14 23:54:59,516 - root - INFO - shutting down: 0:24:57.475271
2017-01-15 00:25:12,466 - root - INFO - shutting down: 0:25:10.616999
2017-01-15 00:54:57,652 - root - INFO - shutting down: 0:24:55.844422
2017-01-15 01:25:52,581 - root - INFO - shutting down: 0:25:50.593156
2017-01-15 01:54:51,265 - root - INFO - shutting down: 0:24:49.347676
2017-01-15 02:25:35,087 - root - INFO - shutting down: 0:25:33.488282
2017-01-15 02:55:05,424 - root - INFO - shutting down: 0:25:03.990036
2017-01-15 03:24:45,634 - root - INFO - shutting down: 0:24:43.871369
2017-01-15 03:54:36,158 - root - INFO - shutting down: 0:24:34.183754
2017-01-15 04:24:56,073 - root - INFO - shutting down: 0:24:54.569207
2017-01-15 04:54:58,330 - root - INFO - shutting down: 0:24:56.910666
2017-01-15 05:24:24,081 - root - INFO - shutting down: 0:24:22.412581
2017-01-15 05:54:26,942 - root - INFO - shutting down: 0:24:25.517462
2017-01-15 06:24:55,516 - root - INFO - shutting down: 0:24:53.228180
2017-01-15 06:54:40,924 - root - INFO - shutting down: 0:24:39.055811
2017-01-15 07:25:02,262 - root - INFO - shutting down: 0:25:00.001297
2017-01-15 07:54:44,987 - root - INFO - shutting down: 0:24:43.380605
2017-01-15 08:24:53,357 - root - INFO - shutting down: 0:24:52.005523
2017-01-15 08:55:06,777 - root - INFO - shutting down: 0:25:04.727537
2017-01-15 09:24:25,455 - root - INFO - shutting down: 0:24:23.347182
edsu commented 7 years ago

@ruebot Wow, Toronto Sun has 61 feeds? Did you mean entries? I'm adding the number of entries checked during the run, and the number skipped to the log output to help.

ruebot commented 7 years ago

@edsu yep. La Presse has 200+, and some 404... which is fun.

edsu commented 7 years ago

If you install v0.0.24 you will get additional information like this:

2017-01-16 13:59:36,353 - root - INFO - shutting down: new=2 checked=4 skipped=10 elapsed=0:00:01.073984
ruebot commented 7 years ago

updating now

ruebot commented 7 years ago

Oh, merge conflicts. I'll update #11.

edsu commented 7 years ago

I had a runtime error in there so you'll need to install a new version. FWIW I'm developing the code separately from the installed version (installed with pip).

ruebot commented 7 years ago

Updated stats:

edsu commented 7 years ago

Cool, it'll be interesting to watch these over time. What is your cron schedule like, and are you using the flock trick to prevent concurrent jobs--until diffengine locks for you?

ruebot commented 7 years ago

I had:

Now I'm giving this a shot again:

0,30 * * * * /usr/bin/flock -xn /tmp/canadaland.lock -c "/usr/local/bin/diffengine /home/nruest/.canadaland"
#0,30 * * * * /usr/local/bin/diffengine /home/nruest/.diffengine
0,30 * * * * /usr/bin/flock -xn /tmp/globemail.lock -c "/usr/local/bin/diffengine /home/nruest/.globemail"
0,30 * * * * /usr/bin/flock -xn /tmp/torontosun.lock -c "/usr/local/bin/diffengine /home/nruest/.torontosun"
0,30 * * * * /usr/bin/flock -xn /tmp/cbc.lock -c "/usr/local/bin/diffengine /home/nruest/.cbc"
0,30 * * * * /usr/bin/flock -xn /tmp/lapresse.lock -c "/usr/local/bin/diffengine /home/nruest/.lapresse"
edsu commented 7 years ago

@gregjan I guess feed/entry fetching could use HTTP headers to avoid getting the full representation. But I still wouldn't want to do thousands of HTTP HEAD requests every time it runs. So I think some kind of logic for backing off would still be desirable?

Currently I have a time.sleep embedded that prevents diffengine from doing more than 1 request per second. I wanted to do this to be nice to the server...and also not get blocked. So that's the main thing slowing things down at the moment.

If multiple sites are being monitored I guess diffengine could stripe the requests across them, rather than working on a feed at a time...

edsu commented 7 years ago

@ruebot nice thing about using the flock is that you can start it up at lower intervals, every 15 minutes or even every 5 minutes...

ryanfb commented 7 years ago

After thinking about this a bit, I wonder if a few simple changes might help:

  1. Make sure we iterate over the entries list in creation order (this might already be happening, but just looking at the code I'm not sure https://github.com/DocNow/diffengine/blob/master/diffengine/__init__.py#L67). This way we're always processing the newest entries first instead of potentially spending a bunch of time doing old entries first - which if you're wrapping your diffengine in a timeout, might starve new entries.
  2. Increase the default threshold. I could see something in the range of .5 (day-old gets checked every half day, week-old gets checked every few days, month-old gets checked every couple weeks) to 1.0 (very simple to understand the age/checking relationship - we could also set an upper bound on the value of hotness to e.g. a month so that we're never waiting more than a month between checks).
  3. In conjunction with 2, reset the creation time if an entry is modified (or add a new modified time attribute and use that to calculate the hotness value). This would allow us to have a higher threshold and still catch the cases where e.g. an article is getting updated frequently over the course of a week, or an old article gets a flurry of updates. This could potentially result in some thrashing when a site makes a global change that causes all their entries to show up as modified after content normalization.