apache / dolphinscheduler

Apache DolphinScheduler is the modern data orchestration platform. Agile to create high performance workflow with low-code
https://dolphinscheduler.apache.org/
Apache License 2.0
12.87k stars 4.62k forks source link

[Feature][dolphinscheduler-common] Support workflow review #10387

Closed kaori-seasons closed 3 months ago

kaori-seasons commented 2 years ago

Search before asking

Description

The current resource center supports hdfs/s3/local storage due to the way of uploading and reading files, only need to add git file storage

When a user uploads a file to the resource center to access ResourcesController, the implementation class HadoopUtils of the StorageOperate interface will implement file operations with S3Utils

The ecplise provides a Java client org.eclipse.jgit to support file storage based on I will compose the API-related storage operation implementation based on the production environment example here jgit-cookbook

Use case

The following are two simple git manipulation examples, which will be further expanded in combination with jgit-cookbook

git create

public class RepositoryProviderExistingClientImpl implements RepositoryProvider {

    private String[] clientPath;
    private String tenantCode;
    public RepositoryProviderExistingClientImpl(String tenantCode, String[] clientPath) {
        this.clientPath = clientPath;
        this.tenantCode = tenantCode;
    }

    @Override
    public Repository getRepository() throws Exception {
        try  {
            File workdir = getFile(tenantCode, clientPath);
            return new FileRepositoryBuilder().setWorkTree(workdir).build();
        }catch (Exception ex){
            throw new StorageOperateTransformException();
        }
    }

    public File getFile(String tenantCode, String... pathComponents) throws IOException {
        String rootPath = new File(new File(tenantCode), "").getPath();
        for (String pathComponent : pathComponents)
            rootPath = rootPath + File.separatorChar + pathComponent;
        File result = new File(rootPath);
        FileUtils.mkdirs(result, true);
        return result;
    }
}

git pull

public class RepositoryProviderCloneImpl implements RepositoryProvider {

    private String repoPath;
    private String clientPath;
    public RepositoryProviderCloneImpl(String repoPath, String clientPath) {
        this.repoPath = repoPath;
        this.clientPath = clientPath;
    }

    @Override
    public Repository getRepository() throws Exception {
        File client = new File(clientPath);
        boolean isCreated = client.mkdir();
        if (!isCreated){
            throw new StorageOperateTransformException("File create failed by using Git!");
        }

        try {
            Git result = Git.cloneRepository()
                .setURI(repoPath)
                .setDirectory(client)
                .call();
            return result.getRepository();
        } catch (Exception ex){
            throw new StorageOperateTransformException(ex.getMessage());
        }
    }
}

Related issues

Nope

Are you willing to submit a PR?

Code of Conduct

github-actions[bot] commented 2 years ago

Thank you for your feedback, we have received your issue, Please wait patiently for a reply.

kaori-seasons commented 2 years ago

Welcome more friends to supplement the scene, discuss here. @SbloodyS @xtr1993 Could you please discuss it together?

xtr1993 commented 2 years ago

I have implemented the function of using git to manage code in my project,Here is my business flowchart, hope it helps you: 632e0dd7f1daf6d0329b084537a3e16

kaori-seasons commented 2 years ago

@xtr1993 Thank you very much, will do some research in the near future

kaori-seasons commented 2 years ago

After preliminary research,, I found that JGit, as a Git client, has a heavy logic and it is not very friendly to manually build a repository locally based on Git commands, so I found some way to upload and download files based on REST-API

refer to Rest Interface for git

EricGao888 commented 2 years ago

IMHO, we could separate this issue into two parts:

  1. Add source control ability for resource center.
  2. Enable users to review workflow changes.

For the first one, since resource center are based on HDFS / S3 / ..., we could add a log file and make it invisible for users in remote storage to store operation log / commit hash code, etc. and combine commit hash code with object name or tag. Use S3 / HDFS read/write interface to interact with this log file to ensure consistency. In that case, we could enable source control not only for txt / sql / sh file, but also for jar / tar and avoid exploding the remote git repo.

For the second one, we could add some kind of mapping function to map workflows into python DAGs. Users will get different versions of python DAGs once they create / update their workflows. Based on that, we could add source control with git protocol to enable users to review workflow changes and provide them with better production experience.

WDYT @complone @xtr1993 @SbloodyS @zhongjiajie @davidzollo

EricGao888 commented 2 years ago

IMHO, we could separate this issue into two parts:

  1. Add source control ability for resource center.
  2. Enable users to review workflow changes.

For the first one, since resource center are based on HDFS / S3 / ..., we could add a log file in remote storage to store operation log / commit hash code, etc. and combine commit hash code with object name or tag. Use S3 / HDFS read/write interface to interact with this log file to ensure consistency. In that case, we could enable source control not only for txt / sql / sh file, but also for jar / tar and avoiding exploding the remote git repo.

For the second one, we could add some kind of mapping function to map workflows into python DAGs. Users will get different versions of python DAGs once they create / update their workflows. Based on that, we could add source control with git protocol to enable users to review workflow changes and provide them with better production experience.

WDYT @complone @xtr1993 @SbloodyS @zhongjiajie @davidzollo

To clarify, those generated python DAGs mentioned above are only for review purposes, we do not really need to run those DAGs. Therefore, there's no need to change current code logic and we may just add an assistant feature.

kaori-seasons commented 2 years ago

@EricGao888 Thank you very much for your reply, for the second point after the discussion with @davidzollo, the more demand in the community is based on the git protocol management. Usually this scenario is every time the user modifies the version of the DAG. I will take the time recently. Check out the running process of airflow to generate DAG management for better design version

EricGao888 commented 2 years ago

@EricGao888 Thank you very much for your reply, for the second point after the discussion with @davidzollo, the more demand in the community is based on the git protocol management. Usually this scenario is every time the user modifies the version of the DAG. I will take the time recently. Check out the running process of airflow to generate DAG management for better design version

@complone Hi complone, thx again for your effort. If you could bring such feature into DS, it will be fantastic. Instead of understanding the running process of airflow, I suggest spending some time on the syntax of airflow DAG. Actually, we may not need to really run such DAGs generated from workflows. The main purpose is to help users review / give suggestions on the changes and python DAGs are easier to review than graphs.

kaori-seasons commented 2 years ago

@EricGao888 Thank you for adding. During this time, I will read the data structure of DAG in airflow, so that I can discuss with you later

EricGao888 commented 2 years ago

@complone FYI, you may also refer to airflow-code-editor to see how workflow as code could be integrated with git. image

kaori-seasons commented 2 years ago

@complone FYI, you may also refer to airflow-code-editor to see how workflow as code could be integrated with git. image

Thank you very much for your help let me take a look first

davidzollo commented 2 years ago

@zhongjiajie do you have any ideas?

xtr1993 commented 2 years ago

I have implemented this function and have demo code, we can discuss this function together; https://github.com/xtr1993/datacenter-git-client-demo.git

kaori-seasons commented 2 years ago

https://github.com/xtr1993/datacenter-git-client-demo.git

Thank you very much for the Git operation encapsulation logic you provide, will try to design logic compatible with dolphinscheduler

ruanwenjun commented 2 years ago

As far as I see, this issue is only talked about adding a new resource center implementation by git.

I didn't see any detailed design about how to manage the resource version, we need to store the resource version in our database?

And this issue doesn't talk about the detail of how to store the workflow in git(resource center), if we don't store the workflow in git, how can we review it? cc @davidzollo @caishunfeng

davidzollo commented 2 years ago

As far as I see, this issue is only talked about adding a new resource center implementation by git.

I didn't see any detailed design about how to manage the resource version, we need to store the resource version in our database?

And this pr doesn't talk about the detail of how to store the workflow in git(resource center), if we don't store the workflow in git, how can we review it? cc @davidzollo @caishunfeng

yes, as @EricGao888 said, https://github.com/apache/dolphinscheduler/issues/10387#issuecomment-1166904625 , I think splitting two issues will be better