CogitoNTNU / TutorAI

TutorAI is a RAG system capable of assisting with learning academic subjects and using the curriculum and citing it. The project revolves around building an application that ingests a textbook in most formats and facilitates efficient learning of the course material.
MIT License
16 stars 7 forks source link

Implement Failover Mechanism for Critical Dependencies to Ensure 99% Uptime #113

Open SverreNystad opened 4 months ago

SverreNystad commented 4 months ago

Implement Failover Mechanism for Critical Dependencies to Ensure 99% Uptime

Description

To meet the Availability quality requirement A1, which states: "System uptime must be 99%, with capabilities to handle critical operations around the clock," we need to address the uptime dependencies of TutorAI on our commercial off-the-shelf (COTS) solutions, specifically OpenAI and MongoDB.

Current Issue

  1. OpenAI:

    • Uptime Guarantee: OpenAI does not provide a Service Level Agreement (SLA) guaranteeing any specific uptime.
    • Track Record: OpenAI does not consistently achieve 99% uptime.
    • Impact: Without a failover mechanism, any downtime from OpenAI directly affects TutorAI's availability.
  2. MongoDB:

    • Uptime Guarantee: MongoDB provides an SLA guaranteeing at least 99% uptime (as per their SLA documentation).
    • Impact: Despite the SLA, downtime would still disrupt major functionalities of TutorAI.

Proposed Solution

To ensure TutorAI meets its uptime requirement, we must implement a failover mechanism for both OpenAI and MongoDB:

  1. For OpenAI:

    • Develop a failover system to automatically switch API usage to an alternative Large Language Model (LLM) provider such as Gemini, Claude, LLama, or Grok during OpenAI downtimes.
  2. For MongoDB:

    • Implement a fallback solution for critical database operations. This could involve setting up a secondary database system or utilizing a distributed database architecture to minimize downtime impact.

Action Items

Conclusion

Implementing these failover mechanisms is crucial to ensuring that TutorAI can achieve the required 99% uptime, thus maintaining reliable operations around the clock despite potential downtime from our COTS dependencies.

SverreNystad commented 4 months ago

It could be of use to use the Chain of Responsibility to handle the failover: Chain of Responsibility is behavioral design pattern that allows passing request along the chain of potential handlers until one of them handles request. Read more here