App Crashes or is slow to load

allthesignals commented 8 years ago

The API seems to crash at random, forcing us to rebuild. Here are the logs before a rebuild:

16:03:18 web.1  | started with pid 8
16:03:21 web.1  | => Booting Puma
16:03:21 web.1  | => Rails 4.1.0 application starting in production on http://0.0.0.0
16:03:21 web.1  | => Run `rails server -h` for more startup options
16:03:21 web.1  | => Notice: server is listening on all interfaces (0.0.0.0). Consider using 127.0.0.1 (--binding option)
16:03:21 web.1  | => Ctrl-C to shutdown server
16:03:22 web.1  | Puma 2.11.3 starting...
16:03:22 web.1  | * Min threads: 0, max threads: 16
16:03:22 web.1  | * Environment: production
16:03:22 web.1  | * Listening on tcp://0.0.0.0
16:45:27 web.1  | terminated by SIGKILL
16:45:27 system | sending SIGTERM to all processes

We should investigate performance monitoring options and investigate further...

allthesignals commented 7 years ago

Failure state: when loading Metrofuture in Action, it takes a long time to load. Sometimes, it intermittently crashes, causes to loader to spin continuously and the map fails to display.

allthesignals commented 7 years ago

Deploying rack-mini-profiler.

allthesignals commented 7 years ago

Assessing #4 and #3 using rack-mini-profiler.

mzagaja commented 7 years ago

It appears that the slowness is related to ActiveRecord. If I make an API call curl 'http://api.projects.metrofuture.org/projects?fields%5Bprojects%5D=title%2Cdescription%2Cstatus%2Cgeography%2Cprimary-department%2Cwebsite%2Cimage&page%5Bnumber%5D=1&page%5Bsize%5D=9' I get the following from Heroku:

2017-01-05T16:15:30.151270+00:00 app[web.1]: Processing by API::V1::ProjectsController#index as */*
2017-01-05T16:15:30.151311+00:00 app[web.1]:   Parameters: {"fields"=>{"projects"=>"title,description,status,geography,primary-department,website,image"}, "page"=>{"number"=>"1", "size"=>"9"}}
2017-01-05T16:15:31.350937+00:00 app[web.1]: CONNECTION RETRY: ActiveRecord::ConnectionAdapters::SQLServerAdapter retry #0.
2017-01-05T16:15:33.149478+00:00 heroku[web.1]: Process running mem=647M(125.4%)
2017-01-05T16:15:33.149513+00:00 heroku[web.1]: Error R14 (Memory quota exceeded)
2017-01-05T16:15:36.565611+00:00 heroku[router]: at=info method=GET path="/projects?fields%5Bprojects%5D=title%2Cdescription%2Cstatus%2Cgeography%2Cprimary-department%2Cwebsite%2Cimage&page%5Bnumber%5D=1&page%5Bsize%5D=9" host=api.projects.metrofuture.org request_id=65bf83ef-846b-441d-ac04-a8f78fd88a97 fwd="104.207.192.42" dyno=web.1 connect=1ms service=6421ms status=200 bytes=10919
2017-01-05T16:15:36.558348+00:00 app[web.1]: Completed 200 OK in 6407ms (Views: 2.4ms | ActiveRecord: 6141.4ms)

So it takes approximately six seconds for ActiveRecord to respond on Heroku.

On a local machine it still seems slow but not as bad: curl 'http://api.lvh.me:3000/projects?fields%5Bprojects%5D=title%2Cdescription%2Cstatus%2Cgeography%2Cprimary-department%2Cwebsite%2Cimage&page%5Bnumber%5D=1&page%5Bsize%5D=9' yields:

Image Load (0.9ms)  EXEC sp_executesql N'SELECT TOP (1) [AnnotationBase].* FROM [AnnotationBase] WHERE [AnnotationBase].[IsDocument] = 1 AND [AnnotationBase].[ObjectTypeCode] = 10056 AND [AnnotationBase].[ObjectId] = @0 ORDER BY ModifiedOn DESC', N'@0 uniqueidentifier', @0 = '2134D493-77B6-E611-B21B-766CC8970D42'  [["ObjectId", "2134D493-77B6-E611-B21B-766CC8970D42"]]
Completed 200 OK in 792ms (Views: 2.7ms | ActiveRecord: 481.5ms)

So @allthesignals was right about the local versus Heroku differential.

Looking at Scout suggests that it is the image query/load which ends up being slow: screen shot 2017-01-05 at 11 27 36 am

I'm going to try upping the dyno to a 2x standard dyno to give us more RAM and less swap and see if it benchmarks differently.

mzagaja commented 7 years ago

Upping to a 2x dyno speeds up our response by about 38% which is quite good but still not as much as we'd like:

2017-01-05T16:37:18.545140+00:00 app[web.1]: Processing by API::V1::ProjectsController#index as */*
2017-01-05T16:37:18.545208+00:00 app[web.1]:   Parameters: {"fields"=>{"projects"=>"title,description,status,geography,primary-department,website,image"}, "page"=>{"number"=>"1", "size"=>"9"}}
2017-01-05T16:37:22.508739+00:00 heroku[router]: at=info method=GET path="/projects?fields%5Bprojects%5D=title%2Cdescription%2Cstatus%2Cgeography%2Cprimary-department%2Cwebsite%2Cimage&page%5Bnumber%5D=1&page%5Bsize%5D=9" host=api.projects.metrofuture.org request_id=babce10e-c5bc-4e59-8b3c-f522f9b56ce8 fwd="104.207.192.42" dyno=web.1 connect=0ms service=3975ms status=200 bytes=10884
2017-01-05T16:37:22.492040+00:00 app[web.1]: Completed 200 OK in 3947ms (Views: 10.0ms | ActiveRecord: 3356.5ms)

Doing a page load at projects.mapc.org seems to show us how slow loading these images is (thankfully lazily loaded) on a 2x dyno:

2017-01-05T16:43:11.610257+00:00 app[web.1]: Sent data img-Norwood Walking Map V1.jpg (0.9ms)
2017-01-05T16:43:11.610418+00:00 app[web.1]: Completed 200 OK in 3020ms (Views: 0.7ms | ActiveRecord: 2955.2ms)
2017-01-05T16:43:11.638809+00:00 app[web.1]:   Rendered text template (0.0ms)
2017-01-05T16:43:11.638978+00:00 app[web.1]: Sent data img-IMG_8067.JPG (0.9ms)
2017-01-05T16:43:11.639248+00:00 app[web.1]: Completed 200 OK in 3041ms (Views: 0.7ms | ActiveRecord: 2994.4ms)

Weirdly a fresh hobby dyno sometimes yields slower image transactions in ActiveRecord:

2017-01-05T16:53:46.946234+00:00 app[web.1]:   Rendered text template (0.1ms)
2017-01-05T16:53:46.947061+00:00 app[web.1]: Sent data img-DanversSquareMixedUse.jpg (2.8ms)
2017-01-05T16:53:46.947316+00:00 app[web.1]: Completed 200 OK in 4408ms (Views: 1.9ms | ActiveRecord: 3888.0ms)
2017-01-05T16:53:46.985442+00:00 app[web.1]:   Rendered text template (0.0ms)
2017-01-05T16:53:46.985712+00:00 app[web.1]: Sent data img-littleton_his_soc.jpg (1.3ms)
2017-01-05T16:53:46.985913+00:00 app[web.1]: Completed 200 OK in 4272ms (Views: 1.2ms | ActiveRecord: 3536.7ms)
2017-01-05T16:53:47.332404+00:00 heroku[router]: at=info method=GET path="/images/img-IMG_8067.JPG?style=small" host=api.projects.metrofuture.org request_id=feaee26b-9075-455b-8971-d4f8a03fb937 fwd="104.207.192.42" dyno=web.1 connect=1ms service=7322ms status=200 bytes=67578
2017-01-05T16:53:47.321831+00:00 app[web.1]:   Rendered text template (0.0ms)
2017-01-05T16:53:47.322096+00:00 app[web.1]: Sent data img-IMG_8067.JPG (0.8ms)
2017-01-05T16:53:47.322404+00:00 app[web.1]: Completed 200 OK in 1588ms (Views: 0.6ms | ActiveRecord: 1467.2ms)
2017-01-05T16:53:48.334932+00:00 heroku[router]: at=info method=GET path="/images/img-Norwood%20Walking%20Map%20V1.jpg?style=small" host=api.projects.metrofuture.org request_id=19f9f101-90af-40c5-882c-7d96e702b863 fwd="104.207.192.42" dyno=web.1 connect=1ms service=8302ms status=200 bytes=75172
2017-01-05T16:53:48.284765+00:00 app[web.1]:   Rendered text template (0.0ms)
2017-01-05T16:53:48.284908+00:00 app[web.1]: Sent data img-Norwood Walking Map V1.jpg (0.7ms)
2017-01-05T16:53:48.285095+00:00 app[web.1]: Completed 200 OK in 2063ms (Views: 0.6ms | ActiveRecord: 1929.0ms)
2017-01-05T16:53:59.491584+00:00 app[web.1]: Started GET "/projects?fields%5Bprojects%5D=title%2Cdescription%2Cstatus%2Cgeography%2Cprimary-department%2Cwebsite%2Cimage&page%5Bnumber%5D=2&page%5Bsize%5D=9" for 104.207.192.42 at 2017-01-05 16:53:59 +0000
2017-01-05T16:53:59.495054+00:00 app[web.1]: Processing by API::V1::ProjectsController#index as HTML
2017-01-05T16:53:59.495128+00:00 app[web.1]:   Parameters: {"fields"=>{"projects"=>"title,description,status,geography,primary-department,website,image"}, "page"=>{"number"=>"2", "size"=>"9"}}
2017-01-05T16:54:01.005013+00:00 app[web.1]: Completed 200 OK in 1510ms (Views: 30.7ms | ActiveRecord: 812.1ms)

This could be a function of the fact that the performance dyno had some queries run against it.

2017-01-05T16:59:47.477565+00:00 app[web.1]: Sent data img-Norwood Walking Map V1.jpg (6.3ms)
2017-01-05T16:59:47.478897+00:00 app[web.1]: Completed 200 OK in 2066ms (Views: 4.5ms | ActiveRecord: 1939.9ms)

Running it a third time shows serious decay to the ActiveRecord performance:

2017-01-05T17:04:00.896808+00:00 app[web.1]: Sent data img-Norwood Walking Map V1.jpg (21.6ms)
2017-01-05T17:04:00.896994+00:00 app[web.1]: Completed 200 OK in 5249ms (Views: 21.4ms | ActiveRecord: 4994.9ms)

A third time gives me slightly better but not great performance:

2017-01-05T17:06:45.588371+00:00 app[web.1]: Sent data img-Norwood Walking Map V1.jpg (1.0ms)
2017-01-05T17:06:45.588697+00:00 app[web.1]: Completed 200 OK in 3455ms (Views: 0.8ms | ActiveRecord: 3387.2ms)

At that point I start getting R14 memory quota exceeded errors.

Compare to 3 runs on performance dyno:

2017-01-05T17:09:00.632656+00:00 app[web.1]: Sent data img-Norwood Walking Map V1.jpg (0.8ms)
2017-01-05T17:09:00.632807+00:00 app[web.1]: Completed 200 OK in 1230ms (Views: 0.6ms | ActiveRecord: 1150.1ms)

2017-01-05T17:10:03.189673+00:00 app[web.1]: Sent data img-Norwood Walking Map V1.jpg (0.7ms)
2017-01-05T17:10:03.190103+00:00 app[web.1]: Completed 200 OK in 1755ms (Views: 0.5ms | ActiveRecord: 1721.6ms)

2017-01-05T17:10:47.725144+00:00 app[web.1]: Sent data img-Norwood Walking Map V1.jpg (0.7ms)
2017-01-05T17:10:47.725146+00:00 app[web.1]: Completed 200 OK in 2860ms (Views: 0.6ms | ActiveRecord: 2815.9ms)

Then a fourth run shows improved performance:

2017-01-05T17:15:57.620712+00:00 app[web.1]: Sent data img-Norwood Walking Map V1.jpg (1.5ms)
2017-01-05T17:15:57.620994+00:00 app[web.1]: Completed 200 OK in 1334ms (Views: 0.9ms | ActiveRecord: 1227.0ms)

So it's more consistent with the better dyno but not perfect. I'm going to investigate the possibility of adding some kind of caching to these images to put this to bed.

mzagaja commented 7 years ago

If we want to throw more hardware at this, spinning up an AWS EC2 instance might be the more affordable path forward:

t2.small 1 Variable 2 EBS Only $0.023 per Hour t2.medium 2 Variable 4 EBS Only $0.047 per Hour

We can get 4GB of RAM (as opposed to 1GB at Heroku) and pay 33.84 a month instead of $50/month.

I will look into caching options this afternoon but it may be worth considering putting this on an EC2.

mzagaja commented 7 years ago

At the end of the day yesterday I settled on potentially using Cloudflare as a CDN to solve this issue. I think it can solve this issue because lots of our data is not actually dynamic, but static data. Heroku has information on HTTP caching at https://devcenter.heroku.com/articles/increasing-application-performance-with-http-cache-headers and the folks at Medium also recommend this approach. If this works it should be the easiest path to resolution and negate the need to resolve the long ActiveRecord times at Heroku.

mzagaja commented 7 years ago

I have been learning much about caching while trying to address this issue. The first thing I learned is that Rails will not serve out HTTP cache headers in development mode, but will serve them in production. Therefore this means you cannot rely on development header output to debug cache changes.

The second thing I learned is that we are using a custom fork of the 0.6.1 version of the JSONAPI-Resources gem. This means that we are unable to upgrade to newer versions with caching features integrated or take advantage of bug fixes. At some point the functionality in this custom fork needs to be factored out into our app since it has not been accepted upstream.

I have begun the setup of Cloudflare and will be able to test it once we have moved over our name servers.

mzagaja commented 7 years ago

Cloudflare has been enabled with the help of Ty and we are waiting for it to propagate for testing. It could take up to 24 hours.

mzagaja commented 7 years ago

Cloudflare appears to have mitigated some of this issue around the images. I am waiting for a response from Cloudflare support about configuring page rules to cache the JSON responses in addition to the image files. If Cloudflare is able to cache these other responses it should further mitigate the speed issues. However users will still encounter speed problems if they attempt to load things that are not in the Cloudflare cache. After the Cloudflare cache is warmed we should be able to reboot the Heroku dyno and it should not be under so much pressure on the RAM side. However going forward it may behoove us to consider upping the Heroku instance and Redis caches to avoid memory issues.

The longer term fix here is to spend a day or so standing up Metrofuture API on a server that is physically near the Microsoft SQL database and also that has more room for the Redis cache and Rails instance.

mzagaja commented 7 years ago

Cloudflare responded with her following note:

Hi,

You need to add an 'Edge Cache TTL' to your page rule, so you can define how long we will cache for.

Damian
Support Engineer | Cloudflare

Adding this setting fixed things and caching now works from Cloudflare. However it is not working inside MAPC due to our internal DNS server not being updated. This hasn't resolved the crashing issue which we will likely need to fix by standing this up on an instance local to the DB.

MAPC / metrofuture-api

App Crashes or is slow to load #1