Open JessyBarrette opened 7 months ago
@fostermh @tayden @n-a-t-e Anybody have thoughts or opinions here?
I don't use or pay any attention to Goose so hopefully one of you are familiar with this.
Otherwise, if that is not the case, I have two ideas I can implement:
@JessyBarrette You are welcome to move the Seaspan Royal code to CapRover or I can find a better fit for it.
according to the AWS monitoring. the CPU was pinned for about an hour, one hour ago. Also if we look further back there has been some significant network thrashing going on. Goose was working pretty hard yo move some data around.
here it is in local time. looks like almost all day on the 13th at 50% cpu and then ~100% this morning for a bit.
given that no one really uses the portal on goose my guess is a large data load on the 13th followed by a query to erddap this morning that pinned the cpu.
FYI. I’m running some scripts that support the Calvert ocean model. They typically run 4 times a day. Should not be CPU intensive. On Nov 14, 2023, at 10:39, Matthew Foster @.***> wrote: here it is in local time. looks like almost all day on the 13th at 50% cpu and then ~100% this morning for a bit.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: @.***>
the Hakai-ctd-tools may runs a set of tests on pushes to Hakai-ctd-tools on goose. This usually last lest than 10min.
@fostermh Can you generate similar plots but for last week? I had issues a lot last week.
here are the last 9 days. note it's aggregated to 1 hour I think so hides the spikes in usage. The second image is cpu only for the same time period aggregated to 1 minute.
All Metrics
CPU
not really used then it seems hum
This is pretty typical of most of our servers. low average usage with high bursts.
After investigating a bit more the goose server @steviewanders and I figured that main issue is related to Goose RAM memory which is saturated at 16Gb, dropping temporarily erddap drop the RAM usage down to 12Gb and came back up to 16Gb once erddap was back on.
@raytula and I discussed briefly and he suggested to bump this RAM in the short term to mitigate the problem while we sort out our potential alternate options for moving ERDDAP to it's own hosting
I see that our reserved instances for both hecate.hakai.org and goose.hakai.org expired this August (see below). So, from a cost perspective, now may be a good time to move either or both of these to a difference instance type. For example, we make use of burstable instance types (t3) in other scenarios. If staying in the 'm' instance class, I suspect there is a newer replacement for m5.
So, considerations:
Also note that they may be newer/better ways to reduce costs, rather than purchasing specific reserved instances.
Interesting and new to me. I had a quick read through, seems Compute Savings Plan would apply to more situations (Fargate and Lambda compute, other regions) and any future changes. They offer the computations for all the possible settings here.
Given the recent changes involving @n-a-t-e moving compute and storage over for https://oceanconnect.ca/, the last 7 days may be more indicative of current and future AWS usage outside solely EC2 consumption (the above link is for 30 days previous)
Not sure what I was thinking about finding the better one....
You can obviously apply two different plans to do different things.
E.g. Compute for our RDS/Lambda and EC2 for our region locked remaining EC2 instances
@JessyBarrette and I met and started testing other options for deploying ERDDAP development
(and perhaps prod
if successful and desired) to other container platforms.
So likely that will go ahead and ERDDAP will move off GOOSE
in the short-medium term (and perhaps Seaspan Royal and other harvester jobs relying on EFS) and that will hopefully lessen the impact of this issue.
@JessyBarrette or @n-a-t-e Noticing any more Goose instability? I assume this was the ERDDAP RAM issue and hence will be resolved permanently by https://github.com/HakaiInstitute/hakai-datasets/pull/166
I personally haven't altough I haven't spend much time on goose lately. I would be ok to drop the goose erddap. now that development.erddap.hakai.app is fully available.
Level
High
Issue Description
hakai-datasets
development
branch is deployed on the goose server for demonstration and testing purposes.The same server is also running a number of Hakai services. Among these:
For some reasons, this server has been recently failing continuously by:
Issue Solution
Perhaps it would be good track our different servers
health
services running
or move erddap and other services to more dedicated deployments
Relevant extra information (log output,etc.)
No response