Open bvizzier-ucsc opened 4 months ago
@hannes-ucsc: "We noticed that ongoing manifest downloads yield many 404s as the index is emptied for reindexing which in turn leads to those downloads to trip the WAF request rate limit."
@hannes-ucsc: "See also #5528 for what could be part of the solution here (disabling the generation of manifests during reindex)."
Assignee to consider next steps.
We'll maintain a maintenance schedule as an JSON object in the S3 shared bucket. The key of the object will be azul/{deployment_name}/azul.json
The format of the object is as follows (by example):
{
"maintenance": {
"schedule": {
"events": [
{
"planned_start": "2024-02-15T18:00:00.000000Z",
"planned_duration": 86400,
"description": "A new HCA release catalog needs to be indexed",
"impacts": [
{
"kind": "partial_responses",
"affected_catalogs": [
"dcp35"
]
},
{
"kind": "degraded_performance",
"affected_catalogs": [
"*"
]
}
],
"actual_start": "2024-02-15T17:59:12.123000Z",
"actual_end": "2024-02-16T22:01:39.023450Z"
},
{
"planned_start": "2024-02-27T23:45:00.000000Z",
"planned_duration": 900,
"description": "A security upgrade of the system database requires the system to be offline for a brief period",
"impacts": [
{
"kind": "service_unavailable",
"affected_catalogs": [
"*"
]
}
],
"actual_start": "2024-02-28T00:14:33.003456Z"
},
{
"planned_start": "2024-03-24T18:00:00.000000Z",
"planned_duration": 345600,
"description": "A reindex of all HCA catalogs is necessary in order to incorporate HCA tissue atlas metadata into the index",
"impacts": [
{
"kind": "partial_responses",
"affected_catalogs": [
"dcp*"
]
},
{
"kind": "degraded_performance",
"affected_catalogs": [
"*"
]
}
]
}
]
}
}
}
The contents of object['maintenance']['schedule']
, henceforth "the schedule", or schedule
, will be exposed on /maintenance/schedule
.
The start
time of an event is its actual_start
if set, or its planned_start
otherwise. The end
time of an event is its actual_end
if set, or its start
plus planned_duration
otherwise. All events in the schedule are sorted by their start
time. No two events have the same start
time. Each event defines an interval [e.start, e.end)
and there is no overlap between these intervals.
A pending event is one where actual_start
is absent. An active event is one where actual_start
is present but actual_end
is absent. There can be at most one active event. Note that the current time is not used in that definition. At minimum, a user interface should render all pending events as planned for the future and the active event to indicate ongoing maintenance. In case of unforeseen circumstances, a pending event could become overdue, i.e. its planned_start
could lapse before it is activated by the operator. The UI should account for that.
I've prototyped the in-memory model for this (src/azul/maintenance.py
):
from datetime import (
UTC,
datetime,
timedelta,
)
from enum import (
Enum,
auto,
)
import json
from operator import (
attrgetter,
)
import sys
from typing import (
Iterator,
Sequence,
)
import attrs
from more_itertools import (
flatten,
only,
)
from azul import (
JSON,
reject,
require,
)
from azul.collections import (
adict,
)
from azul.time import (
format_dcp2_datetime,
parse_dcp2_datetime,
)
class MaintenanceImpactKind(Enum):
partial_responses = auto()
degraded_performance = auto()
service_unavailable = auto()
@attrs.define
class MaintenanceImpact:
kind: MaintenanceImpactKind
affected_catalogs: list[str]
@classmethod
def from_json(cls, impact: JSON):
return cls(kind=MaintenanceImpactKind[impact['kind']],
affected_catalogs=impact['affected_catalogs'])
def to_json(self) -> JSON:
return dict(kind=self.kind.name,
affected_catalogs=self.affected_catalogs)
def validate(self):
require(all(isinstance(c, str) and c for c in self.affected_catalogs),
'Invalid catalog name/pattern')
require(all({0: True, 1: c[-1] == '*'}.get(c.count('*'), False)
for c in self.affected_catalogs),
'Invalid catalog pattern')
@attrs.define
class MaintenanceEvent:
planned_start: datetime
planned_duration: timedelta
description: str
impacts: list[MaintenanceImpact]
actual_start: datetime | None
actual_end: datetime | None
@classmethod
def from_json(cls, event: JSON):
return cls(planned_start=cls._parse_datetime(event['planned_start']),
planned_duration=timedelta(seconds=event['planned_duration']),
description=event['description'],
impacts=list(map(MaintenanceImpact.from_json, event['impacts'])),
actual_start=cls._parse_datetime(event.get('actual_start')),
actual_end=cls._parse_datetime(event.get('actual_end')))
def to_json(self) -> JSON:
result = adict(planned_start=self._format_datetime(self.planned_start),
planned_duration=int(self.planned_duration.total_seconds()),
description=self.description,
impacts=[i.to_json() for i in self.impacts],
actual_start=self._format_datetime(self.actual_start),
actual_end=self._format_datetime(self.actual_end))
return result
@classmethod
def _parse_datetime(cls, value: str | None) -> datetime | None:
return None if value is None else parse_dcp2_datetime(value)
@classmethod
def _format_datetime(cls, value: datetime | None) -> str | None:
return None if value is None else format_dcp2_datetime(value)
@property
def start(self):
return self.actual_start or self.planned_start
@property
def end(self):
return self.actual_end or self.start + self.planned_duration
def validate(self):
require(isinstance(self.planned_start, datetime),
'No planned start', self.planned_start)
require(self.planned_start.tzinfo == UTC)
require(isinstance(self.description, str) and self.description,
'Invalid description', self.description)
for impact in self.impacts:
impact.validate()
reject(self.actual_end is not None and self.actual_start is None,
'Event has end but no start')
require(self.start < self.end,
'Event start is not before end')
@attrs.define
class MaintenanceSchedule:
events: list[MaintenanceEvent]
@classmethod
def from_json(cls, schedule: JSON):
return cls(events=list(map(MaintenanceEvent.from_json, schedule['events'])))
def to_json(self) -> JSON:
return dict(events=[e.to_json() for e in self.events])
def validate(self):
for e in self.events:
e.validate()
starts = set(e.start for e in self.events)
require(len(starts) == len(self.events),
'Start times are not distinct')
# Since starts are distinct, we'll never need the end as a tie breaker
intervals = [(e.start, e.end) for e in self.events]
require(intervals == sorted(intervals),
'Events are not sorted by start time')
values = list(flatten(intervals))
require(values == sorted(values),
'Events overlap', values)
reject(len(self._active_events()) > 1,
'More than one active event')
require(all(e.actual_start is None for e in self.pending_events()),
'Active event mixed in with pending ones')
def pending_events(self) -> list[MaintenanceEvent]:
"""
Returns a list of pending events in this schedule. The elements in the
returned list can be modified until another method is invoked on this schedule. The
modifications will be reflected in ``self.events`` but the caller is
responsible for ensuring they don't invalidate this schedule.
"""
events = enumerate(self.events)
for i, e in events:
if e.actual_start is None:
return self.events[i:]
return []
def active_event(self) -> MaintenanceEvent | None:
return only(self._active_events())
def _active_events(self) -> Sequence[MaintenanceEvent]:
return [
e
for e in self.events
if e.actual_start is not None and e.actual_end is None
]
def _now(self):
return datetime.now(UTC)
def add_event(self, event: MaintenanceEvent):
"""
Add the given event to this schedule unless doing so would invalidate
this schedule.
"""
events = self.events
try:
self.events = events.copy()
self.events.append(event)
self.events.sort(key=attrgetter('start'))
self.validate()
except BaseException:
self.events = events
raise
def cancel_event(self, index: int) -> MaintenanceEvent:
event = self.pending_events()[index]
self.events.remove(event)
self.validate()
return event
def start_event(self) -> MaintenanceEvent:
pending = iter(self.pending_events())
event = next(pending, None)
reject(event is None, 'No events pending to be started')
event.actual_start = self._now()
self._heal(event, pending)
assert event == self.active_event()
return event
def end_event(self) -> MaintenanceEvent:
event = self.active_event()
reject(event is None, 'No active event')
event.actual_end = self._now()
self._heal(event, iter(self.pending_events()))
assert self.active_event() is None
return event
def _heal(self,
event: MaintenanceEvent,
pending: Iterator[MaintenanceEvent]):
for next_event in pending:
if next_event.planned_start < event.end:
next_event.planned_start = event.end
event = next_event
self.validate()
def main():
with open('/Users/hannes/Library/Application Support/JetBrains/PyCharm2024.1/scratches/scratch_22.json') as f:
schedule = MaintenanceSchedule.from_json(json.load(f)['maintenance']['schedule'])
schedule.validate()
print(schedule.active_event())
print(schedule.end_event())
# print(schedule.cancel_event(0))
print(schedule.start_event())
json.dump(schedule.to_json(), sys.stdout, indent=4)
if __name__ == '__main__':
main()
I'll clean this up and commit it to the feature branch before any other work is added.
There will also be a command line utility (scripts/manage_maintenance.py
) that reads the JSON from the bucket, deserializes the model from it, validates the model, applies an action to it, serializes the model back to JSON and finally uploads it back to the bucket where the service exposes it as described above. The service must also validate the model before returning it.
The command line utility should have roughly the following synopsis:
list
list events in JSON form
--all
(include completed events, the default is to list only active and pending events)add
schedule an event
--start {iso_datetime}
(any abbreviations allowed by the datetime
module are OK, implicit (local) or explict timezones should also be accpepted)--duration {iso_duration}
(or whatever human readable shorthand is common/popular in Python)--description {text}
--partial-responses {catalog_name} [{catalog_name} ...]
--degraded-performance {catalog_name} [{catalog_name} ...]
--service-unavailable {catalog_name} [{catalog_name} ...]
cancel
cancel a pending event
--index {number}
start
activate pending eventend
complete the active eventadjust
modify the active event
--duration {iso_duration}
(or whatever human readable shorthand is common/popular in Python)Some of those top-level commands correspond to the model methods. For others, model methods need to be implemented.
Promotion PR checklist items should be added accordingly.
There will soon be multiple deployments that share the prod
ES domain. When a reindex is scheduled for prod
all other deployments will be impacted with degraded performance. The CL items should remind the operator of this caveat.
Open questions:
The issue title ends in maximizing system availability but we need to define what that means and what measures to take. I don't think announcing maintenance windows is actually going to alleviate user frustration. It's just the first thing people think of when the system is impacted by ongoing, unannounce maintenance. As soon as we announce maintenance, users will still be frustrated. If we have the money, I would prefer investing time in A/B deployments. Until then, the approach defined above is probably still the cheaper one overall (no infrastructure cost, some development cost, and moderate operational cost).
Giving notice of 6 days as specified in the description will likely increase the promotion latency by a week. We decide on Tue/Wed what to promote, file the promotion PR on Wed, approve it on Wed/Thu and promote on Thu/Fri. We can schedule maintenance when the PR is approved. Friday promotions are obviously better for our "business" users but tend to disrupt the operator's and system administrator's weekend. We may have to move this around and decide if we want to commit to scheduling maintenance for the weekends, and how to compensate our team members for that type of overtime.
@hannes-ucsc: "Both open questions were discussed in PL. Assignee to consider next steps."
As agreed in PL, the "maximizing system availability" part was moved to another issue.
As agreed in PL, the six-day announcement is aspirational. It would delay every promotion by a week and increase the HCA release latency by that amount of time, which conflicts with another objective given to us by stakeholders in earlier guidance.
The typical promotion timeline will be:
1) Tue: Decide what to promote, file PR
2) Wed: Review and approve PR, use scripts/manage_maintenance.py
in stable deployments to schedule maintenance window, Data Browser automatically displays an announcement
3) Fri: Perform promotion, start reindex
4) Mon: Tend to reindex, triage errors
This looks like a reasonable plan to me.
We just had a user write in to the Support Center asking for this. Is this something we can prioritize?
@kayleemathews This is something that is being worked on, however it sometimes gets bumped by higher priority work. This will require work by both the Azul (back-end) team and the Data Browser (front-end) team. Expectation is we might have something "in a few weeks" due to overall AnVIL & HCA priorities.
There are several facets to the perception by end-users that the system if functioning smoothly.
Changes in several areas can help improve the end-user's perception of the system's availability and operation.
1) Letting users know when the system the system may not be fully functional due to an intentional changes to the system (e.g., system maintenance).
2) Improving system operation to minimize the time that the system functionality is unavailable or limited (e.g., re-indexing in the active production database).
Item 1 above is the result of receiving numerous messages from users and grant organizers when they visit the site while the system is undergoing maintenance, and it is not clear to them why it is unavailable. A notification system will help them understand that the degraded operation is expected.
Pre-Event Notification
Provide end-users with a pre-event notification that the system will be undergoing maintenance starting on a particular date and time and is expected to last for.
Ideally, the pre-event notification would start being displayed approximately 6 business days prior to the maintenance. It is understood that may not always be possible, especially in the case of an urgent hot-fix. In those cases, the notification should be made as soon as possible. In some cases, such as a critical security update, such notification may not be possible.
Some maintenance may not have a predictable outage. In those cases, it is recommended that there be pre-event notification that includes that fact.
During Maintenance Notification
During maintenance, when the system is not fully operational, display a message stating that it is unavailable and when it is expected to be back up.
During an outage, the message should be updated to the above.
Potential operator interaction:
The Data Browser will need to be updated to display the notification string.
The "Example text" provided above should be taken as suggestions and not string requirements.
[Edit: Expanded the scope of the ticket to include system operation improvements and to clarify the need for notification. - BV Feb. 23, 2024.]