Open shazraz opened 4 years ago
On the issue about disjoint UX between local and scheduled job, I think the gap is that Atlas is ignorant about the following pieces of data:
I suggest that if we were to tackle the issue of discrepant UX, we will need to capture these 3 pieces of data at minimum and make the existing GUI/CLI features be driven by this data. This work would be orthogonal to the above design proposal.
For #131, the manual Start/Stop proposal alone seems to be able to address that problem. I'm not clear on whether the other design elements are needed to address that concern. @shazraz can you share what problems those other design elements are trying to address so we can assess a) whether there are alternative solutions and b) whether those issues are of high priority?
To address issue #73, is the suggestion in this design proposal that users configure their local execution to use other archiving methods (e.g. design element 4 above) instead of the local disk option? If the user does not have access to those other options, what would be the solution?
can you share what problems those other design elements are trying to address so we can assess
@ekhl, my thought was that these design elements would collectively address the changes needed to close the gap between scheduled and execution modes. The overall goal is to work towards something that is conceptually (e.g. user not caring about modes) and practically (e.g. less black box behaviour) simpler while keeping in mind the inter-dependencies between the features.
e.g. the creation of the start, stop methods creates additional things to keep in mind for the user when running scripts outside of notebooks. Similarly, the separation we have between execution and submission configurations conceptually exacerbates this divide between the two modes. Collapsing these into something more like a server.config.yaml
and a job.config.yaml
may make more sense. I'll flush out this thought in section 3 above.
a) whether there are alternative solutions
I'm hoping we can use this proposal as a starting point to collectively reach a cleaner design.
b) whether those issues are of high priority?
In my opinion, there may be a few high priority issues that come out of this but what I'm really hoping is that by agreeing collectively on a design, we can provide guardrails for future feature development and avoid landing in a similar situation where developing new features has resulted in an unexpected overall design.
if and where the job is archived (so that the UI can point to it, if the view can reach it, and potentially do things like deleting or retrieving the archive for the user)
- currently the GUI and CLI just assumes that the job is archived with the archive server
if and where the artifacts are stored (similar to 2 above)
To address issue #73, is the suggestion in this design proposal that users configure their local execution to use other archiving methods (e.g. design element 4 above) instead of the local disk option? If the user does not have access to those other options, what would be the solution?
I'm going to treat artifacts and archives as the same issue in the discussion below since we store them in similar ways (although we could potentially split this up and treat them differently which would be another design to explore). I'm still formulating thoughts on this but here are some constraints I'm thinking about.:
Things get fuzzy when a user decides to track a job in Atlas but decides to archive the job in a location that the server does not have access to. We can address this in a few ways:
Not allow this. This could look like something like turning on/off archiving at the job level and but configuring the archive endpoint at the server level. Chances are that a user that wants to archive locally is also running Atlas server locally so the problem is contained within the two constraints described above.
Provide a way to retrieve archives from arbitrary locations and still be able to manipulate those jobs identically in the GUI. I'm not sure what a solution to this would look like.
Somehow mark jobs that have been archives in a location not accessible to the server on the UI. I think this increases complexity.
I'm leaning towards 1. above.
I suggest that if we were to tackle the issue of discrepant UX, we will need to capture these 3 pieces of data at minimum and make the existing GUI/CLI features be driven by this data. This work would be orthogonal to the above design proposal. I
One approach to capture this data could be to notify the scheduler (or some other service that the scheduler would also then listen to) of the job spec for all jobs whether executed locally or via a scheduler and then try and make the behavior consistent for remotely executed jobs (i.e. how does this work when the job is started up and run by the scheduler to avoid duplicate notifications?)
I think this issue is at the crux of closing the feature gap actually, I must have forgotten to include this as a design element. I'm also open to the idea of renaming this design and starting a new design discussion around just this item so that this proposal doesn't blow up even more.
I suggest that if we were to tackle the issue of discrepant UX, we will need to capture these 3 pieces of data at minimum and make the existing GUI/CLI features be driven by this data. This work would be orthogonal to the above design proposal.
I think this issue is at the crux of closing the feature gap actually, I must have forgotten to include this as a design element. I'm also open to the idea of renaming this design and starting a new design discussion around just this item so that this proposal doesn't blow up even more.
I agree with this: can you start up a new issue that focuses on this gap of discrepant UX? I think if we can be crisp in these user-facing issues, it'll help inform what the overall design pattern should look like.
In my opinion, there may be a few high priority issues that come out of this but what I'm really hoping is that by agreeing collectively on a design, we can provide guardrails for future feature development and avoid landing in a similar situation where developing new features has resulted in an unexpected overall design.
There are multiple issues that this one design discussion is trying to tackle. Though I think it's helpful to group all of these issues together to paint a consistent user experience, I'm trying to advocate for tackling these issues in independently and in priority order so we have fewer moving parts and can get feedback on whether our design choices are sound.
In my opinion, I think the inability to control the scope of a job (#131) and the gap of discrepant UX (issue not created yet) are highest priority and we can dive into those in more granular detail in the respective issues.
There are multiple issues that this one design discussion is trying to tackle. Though I think it's helpful to group all of these issues together to paint a consistent user experience, I'm trying to advocate for tackling these issues in independently and in priority order so we have fewer moving parts and can get feedback on whether our design choices are sound.
Yup, this is my thought as well. I see the design discussions as a way to surface multiple somewhat related issues that will then be captured and prioritized as their own tickets.
Reminder to all that deadline to give any design feedback is EOD today.
Problem Statement:
The way a user interacts with jobs in the UI that are run via local execution mode is different from jobs run via the scheduler and this has created a disjoint user experience. There is also no easy way identify which jobs on the UI are a result of local execution.
In addition, as we have focused more on flushing out features of the Foundations Scheduler and job submission, local execution has been largely ignored.
This design proposes a future path for Local Execution mode for users of Jupyter Notebooks or more broadly, self-hosted infrastructure, to give them features that are at parity with submitted jobs (tracking, archiving, saving artifacts & manipulation via the GUI).
Design:
Important elements of this design are:
import foundations
(job creation/config loading/archiving)job.config.yaml
for local execution jobs1. Simplify the import
The import currently does a few main things:
_at_exit_callback()
Item 1 can be moved to Design Element 3 below. Items 3 & 4 above can be moved out of here and into Design Element 2 below. Item 5 can be moved to Design Element 4 below
For Item 2, the following two environment variables are used to set job attributes:
FOUNDATIONS_JOB_ID
: This has identical behavior between submitted and locally executed jobsFOUNDATIONS_PROJECT_NAME
: This is passed as an environment variable for submitted jobs after a hierarchical check of job.config.yaml, command line arguments and base directory name. We can make locally executed jobs consistent with this by performing a similar check after loading in thejob.config.yaml
(see Design Element 3) below.2. Start/Stop jobs manually
User's currently operating in a Jupyter Notebook need to manually start and stop an Atlas job since the environment checks when doing an
import foundations
do not automatically trigger a job.The use-case is for people that want to use jupyter notebooks for running Atlas experiments and track jobs/artifacts/etc on some type of managed notebook service.
We could provide users with two SDK functions as well as a context manager that would allow starting & stopping a job. This could look something like:
OR
In either case, we should examine whether the use of
Job()
infoundations_sdk/src/foundations/job.py
makes sense.Notes on backwards compatibility: We need to decide whether we want to support both this new mechanism of user-initiated jobs as well as automatic jobs created by the local execution of scripts. If a user executes code locally with the above code snippets in place, then two jobs will be initiated (or things might just break). This also applies to jobs launched to the scheduler.
We could either:
Deprecate the use of jobs being triggered automatically when doing an
import foundations
. This tied into a broader goal of stripping away all the behind the scenes magic theimport
statement is doing and giving control back to the user. This breaks backward compatibility for people that rely on local execution for tracking and nothing else.Add in additional environment checks and throw exceptions if a user is trying to execute a script that contains the manual job creation SDK functions and prompt them to remove those before running/submitting their script.
3. What's in a config?
This section will deal with the client side config. For the server-side config, see Design Element 5:
On the client-side:
FOUNDATIONS_HOME
directoryserver.config.yaml
in the job directory that containsscheduler_url=<hostname>:<port>
archive_end_point_type=gcp|aws|nfs|localfilesystem
archive_end_point=<path_to_storage>
job.config.yaml
which contains the following additional parameter in addition to existing ones:archive_job:True|False
We could then follow the existing paradigm of loading in both the
job.config.yaml
andserver.config.yaml
for both local execution and submitted jobs instead ofdefault.config.yaml
for the former andsubmission.config.yaml
for the latter.The
server.config.yaml
can be generated as part of the Atlas server setup and distributed to clients instead of the current manual editing we do for the submission and execution configs.From a user's perspective, tracking and archiving features need to be explicitly enabled in user code if desired. e.g.
We would need additional checks to ensure that a
redis_end_point
is available if a user is initiating a job.The caveat here is that the user would need to update the redis endpoint configuration when tracking a job on a remote Atlas Server from code executed locally (serverhost:5556) vs via the scheduler (foundations_tracker:5556).
For users that are running Atlas server locally, we could circumvent this problem by allowing for container DNS resolution and using
foundations_tracker:5556
Background info on client-side configs:
default.config.yaml
which contains:log_level
archive_end_point
redis_end_point
scheduler.config.yaml
which contains:job_deployment_env
for specifying the scheduler pluginscheduler_url
job.config.yaml
which contains various job specific configuration options.Open questions: Can we re-use the ConfigManager to support this design Is making this backwards compatible a requirement?
4. Archivez
One of the issues that creates a disparate experience for local execution vs scheduled jobs is the availability of job archives. This specific issue can be addressed by leveraging the config changes described in Design Element 3 above to create a user experience where:
foundations.config.archive_job=True|False
server.config.yaml
to which both the archive server and client machine have access.The archive endpoint is determined during the Atlas server setup which can be either remote or local to the client machine:
Thought: This can be improved on by streaming the archive to the scheduler like we do with the inital job payload for scheduled jobs to circumvent the direct access requirements
5. All these Paths!
TL;DR We can largely leave things as is since we've deprecated the submission and execution configs for both the client and worker in Design Element 3.
We could perform a cleanup of the submission config on the server side as follows
We could then merge the two existing scheduler config files described below and include the
job_store_dir_root
andworking_dir_root
in there since these are scheduler specific locations.Background info on server-side configs: Server-side config:
Atlas server config located in
atlas.config.yaml
which contains:Scheduler tracker config in
tracker_client_plugins.yaml
Scheduler database config in
database.config.yaml
Scheduler worker configurations which mirror the client-side submission and execution configs with the exception of using endpoint hostnames within the Docker network instead of localhost.
Note that all of the server side configs are written to the client's
FOUNDATIONS_HOME
and are unused. Another side-effect of our configuration setup is that when setting up a team or remote installation, we have to copy over the execution & submission config files from the server-side, edit theredis_end_point
andscheduler_url
and replace the client's local copy of these files.Related issues:
131 - This ticket provides some sample code for implementing the SDK functions described above.
73 - This ticket asks for deletion of locally executed jobs in the GUI which is currently not possible.
Related Design Discussions:
164