Problem Statement:

The way a user interacts with jobs in the UI that are run via local execution mode is different from jobs run via the scheduler and this has created a disjoint user experience. There is also no easy way identify which jobs on the UI are a result of local execution.

In addition, as we have focused more on flushing out features of the Foundations Scheduler and job submission, local execution has been largely ignored.

This design proposes a future path for Local Execution mode for users of Jupyter Notebooks or more broadly, self-hosted infrastructure, to give them features that are at parity with submitted jobs (tracking, archiving, saving artifacts & manipulation via the GUI).

Design:

Important elements of this design are:

Removing all the side-effects of import foundations (job creation/config loading/archiving)
Providing users explicit control on job creation/termination and read in job.config.yaml for local execution jobs
Removing the concepts of execution and submission configurations, providing sensible defaults and giving the user explicit control over configurations at runtime.
Disabling automatic archiving of jobs and giving the user control of enabling archiving and the archive endpoints. This includes bringing back support for GCP buckets.
Move scheduler-specific aspects of the submission config to the server-side to be completed as part of the Atlas server setup.
Create a way for metadata about jobs to be shared with a centralized service for both locally executed and scheduled jobs (See Related Design Discussions.)

1. Simplify the import

The import currently does a few main things:

Load the execution configuration
Set job attributes from environment variables
Communicate with the tracker to queue & immediately start the job
Update the job status at the end of the job in the _at_exit_callback()
Archive the job

Item 1 can be moved to Design Element 3 below. Items 3 & 4 above can be moved out of here and into Design Element 2 below. Item 5 can be moved to Design Element 4 below

For Item 2, the following two environment variables are used to set job attributes: FOUNDATIONS_JOB_ID: This has identical behavior between submitted and locally executed jobs FOUNDATIONS_PROJECT_NAME: This is passed as an environment variable for submitted jobs after a hierarchical check of job.config.yaml, command line arguments and base directory name. We can make locally executed jobs consistent with this by performing a similar check after loading in the job.config.yaml (see Design Element 3) below.

2. Start/Stop jobs manually

User's currently operating in a Jupyter Notebook need to manually start and stop an Atlas job since the environment checks when doing an import foundations do not automatically trigger a job.

The use-case is for people that want to use jupyter notebooks for running Atlas experiments and track jobs/artifacts/etc on some type of managed notebook service.

We could provide users with two SDK functions as well as a context manager that would allow starting & stopping a job. This could look something like:

foundations.start_job()
// Do something here
foundations.stop_job()

with FoundationsJob() as job:
    // Do something here

In either case, we should examine whether the use of Job() in foundations_sdk/src/foundations/job.py makes sense.

Notes on backwards compatibility: We need to decide whether we want to support both this new mechanism of user-initiated jobs as well as automatic jobs created by the local execution of scripts. If a user executes code locally with the above code snippets in place, then two jobs will be initiated (or things might just break). This also applies to jobs launched to the scheduler.

We could either:

Deprecate the use of jobs being triggered automatically when doing an import foundations. This tied into a broader goal of stripping away all the behind the scenes magic the import statement is doing and giving control back to the user. This breaks backward compatibility for people that rely on local execution for tracking and nothing else.
Add in additional environment checks and throw exceptions if a user is trying to execute a script that contains the manual job creation SDK functions and prompt them to remove those before running/submitting their script.

3. What's in a config?

This section will deal with the client side config. For the server-side config, see Design Element 5:

On the client-side:

No FOUNDATIONS_HOME directory
A server.config.yaml in the job directory that contains
- scheduler_url=<hostname>:<port>
- archive_end_point_type=gcp|aws|nfs|localfilesystem
- archive_end_point=<path_to_storage>
A job.config.yaml which contains the following additional parameter in addition to existing ones:
- archive_job:True|False

We could then follow the existing paradigm of loading in both the job.config.yaml and server.config.yaml for both local execution and submitted jobs instead of default.config.yaml for the former and submission.config.yaml for the latter.

The server.config.yaml can be generated as part of the Atlas server setup and distributed to clients instead of the current manual editing we do for the submission and execution configs.

From a user's perspective, tracking and archiving features need to be explicitly enabled in user code if desired. e.g.

foundations.config.archive_job=True
foundations.config.redis_endpoint='<myredisendpoint>:5556

foundations.start_job()
// Train epic model
foundations.stop_job()

We would need additional checks to ensure that a redis_end_point is available if a user is initiating a job.

The caveat here is that the user would need to update the redis endpoint configuration when tracking a job on a remote Atlas Server from code executed locally (serverhost:5556) vs via the scheduler (foundations_tracker:5556).

For users that are running Atlas server locally, we could circumvent this problem by allowing for container DNS resolution and using foundations_tracker:5556

Background info on client-side configs:

Execution config located in default.config.yaml which contains:
- log_level
- archive_end_point
- redis_end_point
Submission config location in scheduler.config.yaml which contains:
- job_deployment_env for specifying the scheduler plugin
- scheduler_url
- Various server side paths for the scheduler to know and propagate the correct volume mounts.
Job configuration located in job.config.yaml which contains various job specific configuration options.

Open questions: Can we re-use the ConfigManager to support this design Is making this backwards compatible a requirement?

4. Archivez

One of the issues that creates a disparate experience for local execution vs scheduled jobs is the availability of job archives. This specific issue can be addressed by leveraging the config changes described in Design Element 3 above to create a user experience where:

The user is in control of whether or not to archive a job using foundations.config.archive_job=True|False
The archive endpoint of the job is specified in the server.config.yaml to which both the archive server and client machine have access.

The archive endpoint is determined during the Atlas server setup which can be either remote or local to the client machine:

For local Atlas servers, the archive location can be specified to a path on the local filesystem that is mounted into the archive server (current situation)
For remote Atlas servers, the archive location has to be a remote location (localfilesystem on server, network share, etc.) and not the users local file system. The client machine needs to have direct access (via NFS mounts or bucket permissions) to this remote location to enable archiving of locally executed jobs.

Thought: This can be improved on by streaming the archive to the scheduler like we do with the inital job payload for scheduled jobs to circumvent the direct access requirements

5. All these Paths!

TL;DR We can largely leave things as is since we've deprecated the submission and execution configs for both the client and worker in Design Element 3.

We could perform a cleanup of the submission config on the server side as follows

container_config_root: Deprecated
job_results_root: Deprecated. Available in server.config.yaml (archive_end_point)
job_store_dir_root: Move to scheduler specific config
scheduler_url: Deprecated. Available in server.config.yaml
working_dir_root: Move to scheduler specific config

We could then merge the two existing scheduler config files described below and include the job_store_dir_root and working_dir_root in there since these are scheduler specific locations.

Background info on server-side configs: Server-side config:

Atlas server config located in atlas.config.yaml which contains:
- Container images to use
- Auth proxy config
- Docker config & socket
- Redis database location
- Tensorboard working directories
Scheduler tracker config in tracker_client_plugins.yaml
Scheduler database config in database.config.yaml
Scheduler worker configurations which mirror the client-side submission and execution configs with the exception of using endpoint hostnames within the Docker network instead of localhost.

Note that all of the server side configs are written to the client's FOUNDATIONS_HOME and are unused. Another side-effect of our configuration setup is that when setting up a team or remote installation, we have to copy over the execution & submission config files from the server-side, edit the redis_end_point and scheduler_url and replace the client's local copy of these files.

Related issues:

131 - This ticket provides some sample code for implementing the SDK functions described above.

73 - This ticket asks for deletion of locally executed jobs in the GUI which is currently not possible.

Related Design Discussions:

164

On the issue about disjoint UX between local and scheduled job, I think the gap is that Atlas is ignorant about the following pieces of data:

where the job is being executed (so that Atlas can surface the job status and any applicable job control features in GUI or CLI, e.g. if there is no scheduler, user will not be able to stop the job or dequeue it)
- currently the GUI and CLI just assumes that the scheduler is where the job is being executed/was executed
if and where the job is archived (so that the UI can point to it, if the view can reach it, and potentially do things like deleting or retrieving the archive for the user)
- currently the GUI and CLI just assumes that the job is archived with the archive server
if and where the artifacts are stored (similar to 2 above)

I suggest that if we were to tackle the issue of discrepant UX, we will need to capture these 3 pieces of data at minimum and make the existing GUI/CLI features be driven by this data. This work would be orthogonal to the above design proposal.

For #131, the manual Start/Stop proposal alone seems to be able to address that problem. I'm not clear on whether the other design elements are needed to address that concern. @shazraz can you share what problems those other design elements are trying to address so we can assess a) whether there are alternative solutions and b) whether those issues are of high priority?

To address issue #73, is the suggestion in this design proposal that users configure their local execution to use other archiving methods (e.g. design element 4 above) instead of the local disk option? If the user does not have access to those other options, what would be the solution?

can you share what problems those other design elements are trying to address so we can assess

@ekhl, my thought was that these design elements would collectively address the changes needed to close the gap between scheduled and execution modes. The overall goal is to work towards something that is conceptually (e.g. user not caring about modes) and practically (e.g. less black box behaviour) simpler while keeping in mind the inter-dependencies between the features.

e.g. the creation of the start, stop methods creates additional things to keep in mind for the user when running scripts outside of notebooks. Similarly, the separation we have between execution and submission configurations conceptually exacerbates this divide between the two modes. Collapsing these into something more like a server.config.yaml and a job.config.yaml may make more sense. I'll flush out this thought in section 3 above.

a) whether there are alternative solutions

I'm hoping we can use this proposal as a starting point to collectively reach a cleaner design.

b) whether those issues are of high priority?

In my opinion, there may be a few high priority issues that come out of this but what I'm really hoping is that by agreeing collectively on a design, we can provide guardrails for future feature development and avoid landing in a similar situation where developing new features has resulted in an unexpected overall design.

if and where the job is archived (so that the UI can point to it, if the view can reach it, and potentially do things like deleting or retrieving the archive for the user)

currently the GUI and CLI just assumes that the job is archived with the archive server

if and where the artifacts are stored (similar to 2 above)

To address issue #73, is the suggestion in this design proposal that users configure their local execution to use other archiving methods (e.g. design element 4 above) instead of the local disk option? If the user does not have access to those other options, what would be the solution?

I'm going to treat artifacts and archives as the same issue in the discussion below since we store them in similar ways (although we could potentially split this up and treat them differently which would be another design to explore). I'm still formulating thoughts on this but here are some constraints I'm thinking about.:

For a job to have archives, a user needs to explicitly turn on archiving for a job regardless of where its executed.
For archives & artifacts to be accessible via the command-line and GUI, both the client and the server need to have access to that storage location.

Things get fuzzy when a user decides to track a job in Atlas but decides to archive the job in a location that the server does not have access to. We can address this in a few ways:

Not allow this. This could look like something like turning on/off archiving at the job level and but configuring the archive endpoint at the server level. Chances are that a user that wants to archive locally is also running Atlas server locally so the problem is contained within the two constraints described above.
Provide a way to retrieve archives from arbitrary locations and still be able to manipulate those jobs identically in the GUI. I'm not sure what a solution to this would look like.
Somehow mark jobs that have been archives in a location not accessible to the server on the UI. I think this increases complexity.

I'm leaning towards 1. above.

I suggest that if we were to tackle the issue of discrepant UX, we will need to capture these 3 pieces of data at minimum and make the existing GUI/CLI features be driven by this data. This work would be orthogonal to the above design proposal. I

One approach to capture this data could be to notify the scheduler (or some other service that the scheduler would also then listen to) of the job spec for all jobs whether executed locally or via a scheduler and then try and make the behavior consistent for remotely executed jobs (i.e. how does this work when the job is started up and run by the scheduler to avoid duplicate notifications?)

I think this issue is at the crux of closing the feature gap actually, I must have forgotten to include this as a design element. I'm also open to the idea of renaming this design and starting a new design discussion around just this item so that this proposal doesn't blow up even more.

I suggest that if we were to tackle the issue of discrepant UX, we will need to capture these 3 pieces of data at minimum and make the existing GUI/CLI features be driven by this data. This work would be orthogonal to the above design proposal.

I think this issue is at the crux of closing the feature gap actually, I must have forgotten to include this as a design element. I'm also open to the idea of renaming this design and starting a new design discussion around just this item so that this proposal doesn't blow up even more.

I agree with this: can you start up a new issue that focuses on this gap of discrepant UX? I think if we can be crisp in these user-facing issues, it'll help inform what the overall design pattern should look like.

In my opinion, there may be a few high priority issues that come out of this but what I'm really hoping is that by agreeing collectively on a design, we can provide guardrails for future feature development and avoid landing in a similar situation where developing new features has resulted in an unexpected overall design.

There are multiple issues that this one design discussion is trying to tackle. Though I think it's helpful to group all of these issues together to paint a consistent user experience, I'm trying to advocate for tackling these issues in independently and in priority order so we have fewer moving parts and can get feedback on whether our design choices are sound.

In my opinion, I think the inability to control the scope of a job (#131) and the gap of discrepant UX (issue not created yet) are highest priority and we can dive into those in more granular detail in the respective issues.