Hakai Goose server is unstable

JessyBarrette commented 7 months ago

Level

High

Issue Description

hakai-datasets development branch is deployed on the goose server for demonstration and testing purposes.

The same server is also running a number of Hakai services. Among these:

season-royal cron every 30min
hakai-portal development

For some reasons, this server has been recently failing continuously by:

lagging terminal interaction
erddap keeps on crashing
seaspan-royal doesn't seem to be harvesting the data anymore.

Issue Solution

Perhaps it would be good track our different servers

health
services running
or move erddap and other services to more dedicated deployments

Relevant extra information (log output,etc.)

No response

steviewanders commented 7 months ago

@fostermh @tayden @n-a-t-e Anybody have thoughts or opinions here?

I don't use or pay any attention to Goose so hopefully one of you are familiar with this.

Otherwise, if that is not the case, I have two ideas I can implement:

Add detailed monitoring (for example: https://github.com/netdata/netdata) to Goose/Hectate to see if there are identified causes at the time of crashes.
If that does not help, move ERDDAP somewhere else (if possible) as it is the only service suffering

@JessyBarrette You are welcome to move the Seaspan Royal code to CapRover or I can find a better fit for it.

fostermh commented 7 months ago

according to the AWS monitoring. the CPU was pinned for about an hour, one hour ago. Also if we look further back there has been some significant network thrashing going on. Goose was working pretty hard yo move some data around.

fostermh commented 7 months ago

here it is in local time. looks like almost all day on the 13th at 50% cpu and then ~100% this morning for a bit.

fostermh commented 7 months ago

given that no one really uses the portal on goose my guess is a large data load on the 13th followed by a query to erddap this morning that pinned the cpu.

pramod-thupaki commented 7 months ago

FYI. I’m running some scripts that support the Calvert ocean model. They typically run 4 times a day. Should not be CPU intensive. On Nov 14, 2023, at 10:39, Matthew Foster @.***> wrote: here it is in local time. looks like almost all day on the 13th at 50% cpu and then ~100% this morning for a bit.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: @.***>

JessyBarrette commented 7 months ago

the Hakai-ctd-tools may runs a set of tests on pushes to Hakai-ctd-tools on goose. This usually last lest than 10min.

@fostermh Can you generate similar plots but for last week? I had issues a lot last week.

fostermh commented 7 months ago

here are the last 9 days. note it's aggregated to 1 hour I think so hides the spikes in usage. The second image is cpu only for the same time period aggregated to 1 minute.

All Metrics

CPU

JessyBarrette commented 7 months ago

not really used then it seems hum

fostermh commented 7 months ago

This is pretty typical of most of our servers. low average usage with high bursts.

JessyBarrette commented 7 months ago

After investigating a bit more the goose server @steviewanders and I figured that main issue is related to Goose RAM memory which is saturated at 16Gb, dropping temporarily erddap drop the RAM usage down to 12Gb and came back up to 16Gb once erddap was back on.

steviewanders commented 7 months ago

@raytula and I discussed briefly and he suggested to bump this RAM in the short term to mitigate the problem while we sort out our potential alternate options for moving ERDDAP to it's own hosting

raytula commented 7 months ago

I see that our reserved instances for both hecate.hakai.org and goose.hakai.org expired this August (see below). So, from a cost perspective, now may be a good time to move either or both of these to a difference instance type. For example, we make use of burstable instance types (t3) in other scenarios. If staying in the 'm' instance class, I suspect there is a newer replacement for m5.

So, considerations:

Change base instance type of hecate and/or goose
Increase RAM one one or both of these existing servers
Purchase new reserved instances to reduce our costs. Note that purchasing many 'smaller' reserved instances provides flexibility to move things around more later

Also note that they may be newer/better ways to reduce costs, rather than purchasing specific reserved instances.

raytula commented 7 months ago

Savings plans seem like a better way to manage costs now.

steviewanders commented 7 months ago

Interesting and new to me. I had a quick read through, seems Compute Savings Plan would apply to more situations (Fargate and Lambda compute, other regions) and any future changes. They offer the computations for all the possible settings here.

https://us-east-1.console.aws.amazon.com/cost-management/home?region=us-east-1#/savings-plans/recommendations?lookbackPeriodInDays=THIRTY_DAYS&paymentOption=ALL_UPFRONT&scope=PAYER&spType=COMPUTE_SP&termInYears=THREE_YEAR&tokens=%5B%5D

Given the recent changes involving @n-a-t-e moving compute and storage over for https://oceanconnect.ca/, the last 7 days may be more indicative of current and future AWS usage outside solely EC2 consumption (the above link is for 30 days previous)

steviewanders commented 7 months ago

Not sure what I was thinking about finding the better one....

You can obviously apply two different plans to do different things.

E.g. Compute for our RDS/Lambda and EC2 for our region locked remaining EC2 instances

steviewanders commented 7 months ago

@JessyBarrette and I met and started testing other options for deploying ERDDAP development (and perhaps prod if successful and desired) to other container platforms.

So likely that will go ahead and ERDDAP will move off GOOSE in the short-medium term (and perhaps Seaspan Royal and other harvester jobs relying on EFS) and that will hopefully lessen the impact of this issue.

steviewanders commented 3 months ago

@JessyBarrette or @n-a-t-e Noticing any more Goose instability? I assume this was the ERDDAP RAM issue and hence will be resolved permanently by https://github.com/HakaiInstitute/hakai-datasets/pull/166

JessyBarrette commented 3 months ago

I personally haven't altough I haven't spend much time on goose lately. I would be ok to drop the goose erddap. now that development.erddap.hakai.app is fully available.

HakaiInstitute / hakai-datasets