flyteorg / flyte

Scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks.
https://flyte.org
Apache License 2.0
5.62k stars 620 forks source link

[Feature]Automatic documentation in Flyte #531

Open kumare3 opened 4 years ago

kumare3 commented 4 years ago

Motivation: Why do you think this is important? Flyte is a type safe orchestration platform. It understands the flow of data and quickly becomes a one stop shop for most teams pipelining usecases. As the users build a repository of workflows, tasks and launch plans, it is essential to associate documentation with these entities. The documentation would help new team members quickly ramp up on the projects and individual entities. It also lays the foundation for improved discoverability and shareability. E.g. a workflow name or a task name is not enough in describing what the intention is, but associated documentation could provide a detailed description and insight into the algorithms, business usecase etc.

This document proposes a simple and extensible way to support documentation with all Flyte entities, which is captured in Flyte Console and keeps constantly getting updated with versions.

Goal: What should the final outcome look like, ideally?


@workflow(docs=Documentation(
                     short="This pipeline is used to target the right set of drivers for incentives",
                     long_file="/path/to/file",
                     long_format=Documentation.MARKDOWN,
                     source_code=Documentation.source_code_from_config(),
                     icon="http://....",
                     tags=["planning", "campaign"]
                     )
          )
class DriverTargetingWorkflow():
  ....

@workflow(docs=Documentation(
                     short="This pipeline is used to target the right set of drivers for incentives",
                     # Defaults to source_code=Documentation.source_code_from_config(),
                     #Also long defaults to using the docstring in the format rST
                   )
          )
class DriverTargetingWorkflow():
  """
    My RST documentation
  """
  ....

Config:
[docs]
  Repo: github.com/flyteorg/xyz
  project_icon: http://...
  tags: python, java, spark, dnn

Describe alternatives you've considered NA

Flyte component

[Optional] Propose: Link/Inline Specification of the protobuf in FlyteIDL that will be added to all entities - Workflow, Tasks, LaunchPlans, Project

message SourceCode {
    // File where the code is located
    string file = 1;
    // Line number where the task definition, workflow definition, etc starts at
    int line_number = 2;
    // git repository
    string repo = 3;
    // branch of the repository
    string branch = 4;
    // link to the original repository
    string link = 5;
    // language of the code
    string langugae = 6;
}

message Documentation {
    // short description - no more than 256 characters
    string short = 1;
    // Optional information about the source code
    SourceCode info = 3;
    // Optional Tags for easy searching, categorizing etc
    repeated string tags = 4;
}

message LongDocumentation {
    // long description - no more than 4kb
    string long = 1;
    enum DescriptionFormat {
         UNKNOWN = 0;
         MARKDOWN = 1;
         HTML = 2;
         // python default documentation - comments is rst
         RST = 3;
    }
     // format of the long description
    DescriptionFormat long_format = 2;
    // Optional link to an icon for the entity
    string icon_link = 5;
}

Create*Request(
....
   Documentation docs = ...,
...
)

// We will add a special API to create and associate long form documentation
CreateDocs(
  Identifier id = 1;
  LongDocumentation docs = 2;
)

// The Long documentation will be stored in the Blob store and reference will be added to the Metastore

Additional context The UI should be able to show this information. Also the documentation is implicitly versioned with the entities themselves

Is this a blocker for you to adopt Flyte NA

kumare3 commented 4 years ago

cc @wild-endeavor @honnix @kanterov @narape @jeevb @evalsocket @katrogan @EngHabu @vsbus @schottra

wild-endeavor commented 4 years ago

I like the second option - I think picking it up automatically from source code comments is best.

kumare3 commented 4 years ago

I like the second option - I think picking it up automatically from source code comments is best.

I dont think the above are options, I think both should be supported. Python comments are by default or format rST

katrogan commented 4 years ago

+1 for baking documentation into the workflow definitions.

out of curiosity can file paths (long description, optional icon) be remote? if they're not, how does the process for rendering those in the UI work?

also could we provide a means for configuring workflow statuses to their registered domains? e.g. a workflow is considered under development (by default) when registered in a 'development' domain but fully active when registered in 'production'

schottra commented 4 years ago

active vs. deprecated feels like a duplication of the existing metadata field we have for Workflows/Tasks/LaunchPlans?

honnix commented 4 years ago

This looks great and I am all for attaching doc to workflow. Just a minor concern that this could potentially easily make the payload a few times bigger. Will the size be enforced at the client side or server?

yindia commented 4 years ago

+1 for doing this. It will open more possibility like we can redirect user from flyteconsole to the exact line number in the file. It is helpfull in debuging

kumare3 commented 4 years ago

@honnix & @katrogan I can answer both your questions together hopefully.

About | also could we provide a means for configuring workflow statuses to their registered domains? e.g. a workflow is considered under development (by default) when registered in a 'development' domain but fully active when registered in 'production' I agree with @schottra and I removed it from the list

kumare3 commented 4 years ago

@schottra good point, we can derive it from the existing field. Let me remove it from the docs

gonzedge commented 4 years ago

+1 for doing this too.

schottra commented 4 years ago

@kumare3 Any thoughts on how to map values from a version of a workflow to the overall entity? We struggled with this idea when originally adding metadata fields. description and active are both handled as metadata at a level above the individual version, because we treat all versions of a task/workflow as one single logical entity. The UI is also set up like this, with the Workflow and Task details page showing the NamedEntity, not an individual version.

So I'm wondering what we do if user A pushes version 123abc with a description, and user B pushes version 234def with a different description. Which one becomes the description for the workflow which is shown in the UI?

schottra commented 4 years ago

@kumare3 I'm also just noticing that description is another NamedEntityMetadata field that is being duplicated by this. But if I understand correctly, you're intending to allow a separate description per version of a workflow/task?

kumare3 commented 4 years ago

@kumare3 I'm also just noticing that description is another NamedEntityMetadata field that is being duplicated by this. But if I understand correctly, you're intending to allow a separate description per version of a workflow/task?

@schottra, I did see that the NamedEntityMetadata has a description field. This does not seem to be correctly designed.

  1. it does not maintain a history of the documentation (which I think is useful).
  2. We cannot really easily tie updating the documentation with the registration process and docs are updated usually in this way

On the other hand, I do see that you use the description field, but, for example in the Tasks page, you show the interface of the task, which is not really metadata, but evolves with the version. I think we should show the docs in exactly the same way. Is it ordered by time?

schottra commented 4 years ago

@kumare3 I agree that history of documentation and updating via registration are both useful. The only concern is what criteria to use to determine which version should be shown in the Console.

For the Tasks page, we arbitrarily chose the "latest" version, whatever the first result is when hitting the Task endpoint. That doesn't actually hold any semantic value, since the most recently registered version can be any random version from some person's feature branch. In practice, it's probably most likely correct. But I'm wary of just taking the latest version for descriptions with no actual mechanism in place to let users specify which version is the main/production/definitive version. Thoughts?

kumare3 commented 3 years ago

On second, thought, Maybe not everything should be versioned, but a clean flow to have registration and documentation tied would be great - @katrogan

katrogan commented 3 years ago

@kumare3 that's what the NamedEntity stuff already covers, right? +1 we should improve how we expose it to users to edit and update

kumare3 commented 3 years ago

@kumare3 that's what the NamedEntity stuff already covers, right? +1 we should improve how we expose it to users to edit and update

@katrogan yup, but I think the git links, line no etc should be added to every taskspec?

heidimhurst commented 1 year ago

+1, this would be an excellent feature to roll into an upcoming release

kumare3 commented 1 year ago

This is now infact implemented in the code and backend and not Implemented in UI yet

github-actions[bot] commented 10 months ago

Hello 👋, this issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will engage on it to decide if it is still applicable. Thank you for your contribution and understanding! 🙏