bvizzier-ucsc commented 4 months ago

There are several facets to the perception by end-users that the system if functioning smoothly.
Changes in several areas can help improve the end-user's perception of the system's availability and operation.

1) Letting users know when the system the system may not be fully functional due to an intentional changes to the system (e.g., system maintenance).

2) Improving system operation to minimize the time that the system functionality is unavailable or limited (e.g., re-indexing in the active production database).

Item 1 above is the result of receiving numerous messages from users and grant organizers when they visit the site while the system is undergoing maintenance, and it is not clear to them why it is unavailable. A notification system will help them understand that the degraded operation is expected.

Pre-Event Notification

Provide end-users with a pre-event notification that the system will be undergoing maintenance starting on a particular date and time and is expected to last for .

Example text: "The will be undergoing maintenance on starting around , and is expected to take . During this time the system may be unavailable."

Ideally, the pre-event notification would start being displayed approximately 6 business days prior to the maintenance. It is understood that may not always be possible, especially in the case of an urgent hot-fix. In those cases, the notification should be made as soon as possible. In some cases, such as a critical security update, such notification may not be possible.

Some maintenance may not have a predictable outage. In those cases, it is recommended that there be pre-event notification that includes that fact.

Example text: "The will be undergoing maintenance on starting around , and is expect to take . The system is expected to be operational during this maintenance."

During Maintenance Notification

During maintenance, when the system is not fully operational, display a message stating that it is unavailable and when it is expected to be back up.

Example text: "The is undergoing maintenance and is currently unavailable. We expect it to return to full operation on or before ."

During an outage, the message should be updated to the above.

Potential operator interaction:

Command for the operator to populate or update the "pre-event notification."
Command for the operator to switch message to the "during maintenance" notification.
Command to clear the notifications once the maintenance is complete. (This could potentially be automated.)

The Data Browser will need to be updated to display the notification string.

The "Example text" provided above should be taken as suggestions and not string requirements.

[Edit: Expanded the scope of the ticket to include system operation improvements and to clarify the need for notification. - BV Feb. 23, 2024.]

[ ] Security design review completed; the Resolution of this issue does not …
- [ ] … affect authentication; for example:
- OAuth 2.0 with the application (API or Swagger UI)
- Authentication of developers with Google Cloud APIs
- Authentication of developers with AWS APIs
- Authentication with a GitLab instance in the system
- Password and 2FA authentication with GitHub
- API access token authentication with GitHub
- Authentication with
- [ ] … affect the permissions of internal users like access to
- Cloud resources on AWS and GCP
- GitLab repositories, projects and groups, administration
- an EC2 instance via SSH
- GitHub issues, pull requests, commits, commit statuses, wikis, repositories, organizations
- [ ] … affect the permissions of external users like access to
- TDR snapshots
- [ ] … affect permissions of service or bot accounts
- Cloud resources on AWS and GCP
- [ ] … affect audit logging in the system, like
- adding, removing or changing a log message that represents an auditable event
- changing the routing of log messages through the system
- [ ] … affect monitoring of the system
- [ ] … introduce a new software dependency like
- Python packages on PYPI
- Command-line utilities
- Docker images
- Terraform providers
- [ ] … add an interface that exposes sensitive or confidential data at the security boundary
- [ ] … affect the encryption of data at rest
- [ ] … require persistence of sensitive or confidential data that might require encryption at rest
- [ ] … require unencrypted transmission of data within the security boundary
- [ ] … affect the network security layer; for example by
- modifying, adding or removing firewall rules
- modifying, adding or removing security groups
- changing or adding a port a service, proxy or load balancer listens on
[ ] Documentation on any unchecked boxes is provided in comments below

dsotirho-ucsc commented 4 months ago

@hannes-ucsc: "We noticed that ongoing manifest downloads yield many 404s as the index is emptied for reindexing which in turn leads to those downloads to trip the WAF request rate limit."

dsotirho-ucsc commented 4 months ago

@hannes-ucsc: "See also #5528 for what could be part of the solution here (disabling the generation of manifests during reindex)."

dsotirho-ucsc commented 4 months ago

Assignee to consider next steps.

hannes-ucsc commented 4 months ago

We'll maintain a maintenance schedule as an JSON object in the S3 shared bucket. The key of the object will be azul/{deployment_name}/azul.json

The format of the object is as follows (by example):

{
    "maintenance": {
        "schedule": {
            "events": [
                {
                    "planned_start": "2024-02-15T18:00:00.000000Z",
                    "planned_duration": 86400,
                    "description": "A new HCA release catalog needs to be indexed",
                    "impacts": [
                        {
                            "kind": "partial_responses",
                            "affected_catalogs": [
                                "dcp35"
                            ]
                        },
                        {
                            "kind": "degraded_performance",
                            "affected_catalogs": [
                                "*"
                            ]
                        }
                    ],
                    "actual_start": "2024-02-15T17:59:12.123000Z",
                    "actual_end": "2024-02-16T22:01:39.023450Z"
                },
                {
                    "planned_start": "2024-02-27T23:45:00.000000Z",
                    "planned_duration": 900,
                    "description": "A security upgrade of the system database requires the system to be offline for a brief period",
                    "impacts": [
                        {
                            "kind": "service_unavailable",
                            "affected_catalogs": [
                                "*"
                            ]
                        }
                    ],
                    "actual_start": "2024-02-28T00:14:33.003456Z"
                },
                {
                    "planned_start": "2024-03-24T18:00:00.000000Z",
                    "planned_duration": 345600,
                    "description": "A reindex of all HCA catalogs is necessary in order to incorporate HCA tissue atlas metadata into the index",
                    "impacts": [
                        {
                            "kind": "partial_responses",
                            "affected_catalogs": [
                                "dcp*"
                            ]
                        },
                        {
                            "kind": "degraded_performance",
                            "affected_catalogs": [
                                "*"
                            ]
                        }
                    ]
                }
            ]
        }
    }
}

The contents of object['maintenance']['schedule'], henceforth "the schedule", or schedule, will be exposed on /maintenance/schedule.

The start time of an event is its actual_start if set, or its planned_start otherwise. The end time of an event is its actual_end if set, or its start plus planned_duration otherwise. All events in the schedule are sorted by their start time. No two events have the same start time. Each event defines an interval [e.start, e.end) and there is no overlap between these intervals.

A pending event is one where actual_start is absent. An active event is one where actual_start is present but actual_end is absent. There can be at most one active event. Note that the current time is not used in that definition. At minimum, a user interface should render all pending events as planned for the future and the active event to indicate ongoing maintenance. In case of unforeseen circumstances, a pending event could become overdue, i.e. its planned_start could lapse before it is activated by the operator. The UI should account for that.

I've prototyped the in-memory model for this (src/azul/maintenance.py):

from datetime import (
    UTC,
    datetime,
    timedelta,
)
from enum import (
    Enum,
    auto,
)
import json
from operator import (
    attrgetter,
)
import sys
from typing import (
    Iterator,
    Sequence,
)

import attrs
from more_itertools import (
    flatten,
    only,
)

from azul import (
    JSON,
    reject,
    require,
)
from azul.collections import (
    adict,
)
from azul.time import (
    format_dcp2_datetime,
    parse_dcp2_datetime,
)

class MaintenanceImpactKind(Enum):
    partial_responses = auto()
    degraded_performance = auto()
    service_unavailable = auto()

@attrs.define
class MaintenanceImpact:
    kind: MaintenanceImpactKind
    affected_catalogs: list[str]

    @classmethod
    def from_json(cls, impact: JSON):
        return cls(kind=MaintenanceImpactKind[impact['kind']],
                   affected_catalogs=impact['affected_catalogs'])

    def to_json(self) -> JSON:
        return dict(kind=self.kind.name,
                    affected_catalogs=self.affected_catalogs)

    def validate(self):
        require(all(isinstance(c, str) and c for c in self.affected_catalogs),
                'Invalid catalog name/pattern')
        require(all({0: True, 1: c[-1] == '*'}.get(c.count('*'), False)
                    for c in self.affected_catalogs),
                'Invalid catalog pattern')

@attrs.define
class MaintenanceEvent:
    planned_start: datetime
    planned_duration: timedelta
    description: str
    impacts: list[MaintenanceImpact]
    actual_start: datetime | None
    actual_end: datetime | None

    @classmethod
    def from_json(cls, event: JSON):
        return cls(planned_start=cls._parse_datetime(event['planned_start']),
                   planned_duration=timedelta(seconds=event['planned_duration']),
                   description=event['description'],
                   impacts=list(map(MaintenanceImpact.from_json, event['impacts'])),
                   actual_start=cls._parse_datetime(event.get('actual_start')),
                   actual_end=cls._parse_datetime(event.get('actual_end')))

    def to_json(self) -> JSON:
        result = adict(planned_start=self._format_datetime(self.planned_start),
                       planned_duration=int(self.planned_duration.total_seconds()),
                       description=self.description,
                       impacts=[i.to_json() for i in self.impacts],
                       actual_start=self._format_datetime(self.actual_start),
                       actual_end=self._format_datetime(self.actual_end))
        return result

    @classmethod
    def _parse_datetime(cls, value: str | None) -> datetime | None:
        return None if value is None else parse_dcp2_datetime(value)

    @classmethod
    def _format_datetime(cls, value: datetime | None) -> str | None:
        return None if value is None else format_dcp2_datetime(value)

    @property
    def start(self):
        return self.actual_start or self.planned_start

    @property
    def end(self):
        return self.actual_end or self.start + self.planned_duration

    def validate(self):
        require(isinstance(self.planned_start, datetime),
                'No planned start', self.planned_start)
        require(self.planned_start.tzinfo == UTC)
        require(isinstance(self.description, str) and self.description,
                'Invalid description', self.description)
        for impact in self.impacts:
            impact.validate()
        reject(self.actual_end is not None and self.actual_start is None,
               'Event has end but no start')
        require(self.start < self.end,
                'Event start is not before end')

@attrs.define
class MaintenanceSchedule:
    events: list[MaintenanceEvent]

    @classmethod
    def from_json(cls, schedule: JSON):
        return cls(events=list(map(MaintenanceEvent.from_json, schedule['events'])))

    def to_json(self) -> JSON:
        return dict(events=[e.to_json() for e in self.events])

    def validate(self):
        for e in self.events:
            e.validate()
        starts = set(e.start for e in self.events)
        require(len(starts) == len(self.events),
                'Start times are not distinct')
        # Since starts are distinct, we'll never need the end as a tie breaker
        intervals = [(e.start, e.end) for e in self.events]
        require(intervals == sorted(intervals),
                'Events are not sorted by start time')
        values = list(flatten(intervals))
        require(values == sorted(values),
                'Events overlap', values)
        reject(len(self._active_events()) > 1,
               'More than one active event')
        require(all(e.actual_start is None for e in self.pending_events()),
                'Active event mixed in with pending ones')

    def pending_events(self) -> list[MaintenanceEvent]:
        """
        Returns a list of pending events in this schedule. The elements in the
        returned list can be modified until another method is invoked on this schedule. The
        modifications will be reflected in ``self.events`` but the caller is
        responsible for ensuring they don't invalidate this schedule.
        """
        events = enumerate(self.events)
        for i, e in events:
            if e.actual_start is None:
                return self.events[i:]
        return []

    def active_event(self) -> MaintenanceEvent | None:
        return only(self._active_events())

    def _active_events(self) -> Sequence[MaintenanceEvent]:
        return [
            e
            for e in self.events
            if e.actual_start is not None and e.actual_end is None
        ]

    def _now(self):
        return datetime.now(UTC)

    def add_event(self, event: MaintenanceEvent):
        """
        Add the given event to this schedule unless doing so would invalidate
        this schedule.
        """
        events = self.events
        try:
            self.events = events.copy()
            self.events.append(event)
            self.events.sort(key=attrgetter('start'))
            self.validate()
        except BaseException:
            self.events = events
            raise

    def cancel_event(self, index: int) -> MaintenanceEvent:
        event = self.pending_events()[index]
        self.events.remove(event)
        self.validate()
        return event

    def start_event(self) -> MaintenanceEvent:
        pending = iter(self.pending_events())
        event = next(pending, None)
        reject(event is None, 'No events pending to be started')
        event.actual_start = self._now()
        self._heal(event, pending)
        assert event == self.active_event()
        return event

    def end_event(self) -> MaintenanceEvent:
        event = self.active_event()
        reject(event is None, 'No active event')
        event.actual_end = self._now()
        self._heal(event, iter(self.pending_events()))
        assert self.active_event() is None
        return event

    def _heal(self,
              event: MaintenanceEvent,
              pending: Iterator[MaintenanceEvent]):
        for next_event in pending:
            if next_event.planned_start < event.end:
                next_event.planned_start = event.end
            event = next_event
        self.validate()

def main():
    with open('/Users/hannes/Library/Application Support/JetBrains/PyCharm2024.1/scratches/scratch_22.json') as f:
        schedule = MaintenanceSchedule.from_json(json.load(f)['maintenance']['schedule'])
        schedule.validate()
        print(schedule.active_event())
        print(schedule.end_event())
        # print(schedule.cancel_event(0))
        print(schedule.start_event())
        json.dump(schedule.to_json(), sys.stdout, indent=4)

if __name__ == '__main__':
    main()

I'll clean this up and commit it to the feature branch before any other work is added.

There will also be a command line utility (scripts/manage_maintenance.py) that reads the JSON from the bucket, deserializes the model from it, validates the model, applies an action to it, serializes the model back to JSON and finally uploads it back to the bucket where the service exposes it as described above. The service must also validate the model before returning it.

The command line utility should have roughly the following synopsis:

list list events in JSON form
- --all (include completed events, the default is to list only active and pending events)
add schedule an event
- --start {iso_datetime} (any abbreviations allowed by the datetime module are OK, implicit (local) or explict timezones should also be accpepted)
- --duration {iso_duration} (or whatever human readable shorthand is common/popular in Python)
- --description {text}
- --partial-responses {catalog_name} [{catalog_name} ...]
- --degraded-performance {catalog_name} [{catalog_name} ...]
- --service-unavailable {catalog_name} [{catalog_name} ...]
cancel cancel a pending event
- --index {number}
start activate pending event
end complete the active event
adjust modify the active event
- --duration {iso_duration} (or whatever human readable shorthand is common/popular in Python)

Some of those top-level commands correspond to the model methods. For others, model methods need to be implemented.

Promotion PR checklist items should be added accordingly.

There will soon be multiple deployments that share the prod ES domain. When a reindex is scheduled for prod all other deployments will be impacted with degraded performance. The CL items should remind the operator of this caveat.

Open questions:

The issue title ends in maximizing system availability but we need to define what that means and what measures to take. I don't think announcing maintenance windows is actually going to alleviate user frustration. It's just the first thing people think of when the system is impacted by ongoing, unannounce maintenance. As soon as we announce maintenance, users will still be frustrated. If we have the money, I would prefer investing time in A/B deployments. Until then, the approach defined above is probably still the cheaper one overall (no infrastructure cost, some development cost, and moderate operational cost).

Giving notice of 6 days as specified in the description will likely increase the promotion latency by a week. We decide on Tue/Wed what to promote, file the promotion PR on Wed, approve it on Wed/Thu and promote on Thu/Fri. We can schedule maintenance when the PR is approved. Friday promotions are obviously better for our "business" users but tend to disrupt the operator's and system administrator's weekend. We may have to move this around and decide if we want to commit to scheduling maintenance for the weekends, and how to compensate our team members for that type of overtime.

dsotirho-ucsc commented 4 months ago

@hannes-ucsc: "Both open questions were discussed in PL. Assignee to consider next steps."

hannes-ucsc commented 4 months ago

As agreed in PL, the "maximizing system availability" part was moved to another issue.

As agreed in PL, the six-day announcement is aspirational. It would delay every promotion by a week and increase the HCA release latency by that amount of time, which conflicts with another objective given to us by stakeholders in earlier guidance.

The typical promotion timeline will be:

1) Tue: Decide what to promote, file PR 2) Wed: Review and approve PR, use scripts/manage_maintenance.py in stable deployments to schedule maintenance window, Data Browser automatically displays an announcement 3) Fri: Perform promotion, start reindex 4) Mon: Tend to reindex, triage errors

bvizzier-ucsc commented 4 months ago

This looks like a reasonable plan to me.

kayleemathews commented 3 months ago

We just had a user write in to the Support Center asking for this. Is this something we can prioritize?

bvizzier-ucsc commented 3 months ago

@kayleemathews This is something that is being worked on, however it sometimes gets bumped by higher priority work. This will require work by both the Azul (back-end) team and the Data Browser (front-end) team. Expectation is we might have something "in a few weeks" due to overall AnVIL & HCA priorities.

DataBiosphere / azul

Announce maintenance in stable deployments #5979

Pre-Event Notification

During Maintenance Notification

Potential operator interaction: