common-workflow-language / cwlviewer

A web application to view and share Common Workflow Language workflows
https://view.commonwl.org/
Apache License 2.0
44 stars 27 forks source link

errors getting raw url as part of RO Bundle for not GitHub / GitLab repos #488

Open mr-c opened 1 year ago

mr-c commented 1 year ago

{'url': 'https://gitlab.bsc.es/lrodrig1/structuralvariants_poc.git', 'branch': '1.0.7', 'path': 'structuralvariants/cwl/subworkflows/bwa_index.cwl'}

2022-12-31 16:30:23,369 ERROR [task-4] org.commonwl.view.researchobject.ROBundleService: Could not pack workflow when creating Research Object: While fetching https://gitlab.bsc.es/lrodrig1/structuralvariants_poc.git, got content-type of 'text/html'. Expected one of ['text/plain', 'application/json', 'text/vnd.yaml', 'text/yaml', 'text/x-yaml', 'application/x-yaml', 'application/octet-stream'].
ERROR Tool definition failed validation:
https://gitlab.bsc.es/lrodrig1/structuralvariants_poc.git:5:17: mapping values are not allowed here

org.commonwl.view.cwl.CWLValidationException: While fetching https://gitlab.bsc.es/lrodrig1/structuralvariants_poc.git, got content-type of 'text/html'. Expected one of ['text/plain', 'application/json', 'text/vnd.yaml', 'text/yaml', 'text/x-yaml', 'application/x-yaml', 'application/octet-stream'].
ERROR Tool definition failed validation:
https://gitlab.bsc.es/lrodrig1/structuralvariants_poc.git:5:17: mapping values are not allowed here

    at org.commonwl.view.cwl.CWLTool.runCwltoolOnWorkflow(CWLTool.java:121)
    at org.commonwl.view.cwl.CWLTool.getPackedVersion(CWLTool.java:60)
    at org.commonwl.view.researchobject.ROBundleService.createBundle(ROBundleService.java:204)
    at org.commonwl.view.researchobject.ROBundleFactory.createWorkflowRO(ROBundleFactory.java:80)
    at org.commonwl.view.researchobject.ROBundleFactory$$FastClassBySpringCGLIB$$c15d1fdc.invoke(<generated>)
    at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:218)
    at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:793)
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163)
    at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:763)
    at org.springframework.aop.interceptor.AsyncExecutionInterceptor.lambda$invoke$0(AsyncExecutionInterceptor.java:115)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:833)

Due to not detecting that https://gitlab.bsc.es is a GitLab based host

https://github.com/common-workflow-language/cwlviewer/blob/91a06b908b75247691b9d1b04c42c747a4353baa/src/main/java/org/commonwl/view/git/GitDetails.java#L98

https://github.com/common-workflow-language/cwlviewer/blob/91a06b908b75247691b9d1b04c42c747a4353baa/src/main/java/org/commonwl/view/git/GitDetails.java#L195-L196

A harder example: {'url': 'https://git.wur.nl/unlock/cwl.git', 'branch': 'master', 'path': 'cwl/workflows/workflow_indexbuilder.cwl'} (also GitLab based host)

This raw URL is needed to pack the workflow; why aren't we using the local git checkout? https://github.com/common-workflow-language/cwlviewer/blob/91a06b908b75247691b9d1b04c42c747a4353baa/src/main/java/org/commonwl/view/researchobject/ROBundleService.java#L203-L204

mr-c commented 1 year ago

Perhaps GitLab style hosts could be detected by a well known path or API?

kinow commented 1 year ago

Perhaps GitLab style hosts could be detected by a well known path or API?

Oh, that sounds like an interesting problem. Let me check if I can find a way to tell whether a URL is GitHub, BitBucket, or GitLab (I have an idea on how to find it :grimacing: )

kinow commented 1 year ago

Alright, my first idea flopped. I remembered that in Jenkins you could use GitLab, BitBucket, or GitHub. I thought they had already found a way to identify the server for a given URL, but looks like they only identify the cloud versions (i.e. github.com/, gitlab.com/, and bitbucket.org/*).

kinow commented 1 year ago

Second idea was to identify the repository based on refs. Pull requests generate a refs/pull/$ID, and merge requests generate something like refs/merge-requests/$ID. I think in bitbucket it's something else like refs/pull-requests/$ID.

But if you have pull/merge requests disabled, or if you have no open requests, then I believe the git client won't list anything. I had a look at git show-ref but couldn't find a way to rely on refs to identify the repo.

Maybe we could query for commits?

In a GitHub repository, the URL will be something like: https://<host>/<org>/<repo>/commits/master. In a GitLab repository that will be https://<host>/<org>/<repo>/-/commits/master. In BitBucket it's https://<host>/<org>/<repo>/commits/branch/master. So in theory a curl -I and a check for status 200 could be used to identify the server type of a given repository URL?

I think the logic would be to first check the host name for GitHub.com, GitLab.com, or BitBucket.org. If that fails, then curl for this commits URL. Finally throw an error as we couldn't identify the server type.

WDYT @mr-c?

mr-c commented 1 year ago

WDYT @mr-c?

Github is only github.com (I know of no other public installations); likewise for bitbucket. Therefore it is just self-hosted GitLab that needs detecting; so maybe try https://hostname/api/v4/projects (which doesn't require a token) and use a valid response as an indicator?

kinow commented 1 year ago

Github is only github.com (I know of no other public installations);

At NIWA we thought about the enterprise option, but it was too expensive at the time (some of the code was for-profit, or mixed). Unis and not for profit had a special price for the enterprise if I recall? I remember we had access to the silver plan as we were a research institution (at that time only silver gave private repos, nowadays everybody has access to it), so that's what we got.

Now I think besides the big FAANG companies, not for profit and some unis might have github enterprise installed, e.g.

But I think we can skip it and implement it later if needed, especially as I am not sure if any of these unis host public repositories (in GitLab I can choose whether my projects are public/private/internal, no idea about github enterprise).

likewise for bitbucket.

A telco I worked with briefly in New Zealand used the BitBucket server (I think other NZ companies used it due to Atlassian being from Aus - the complete Atlassian suite with confluence/jira/bitbucket/bamboo/etc wasn't very expensive some time ago).

Therefore it is just self-hosted GitLab that needs detecting; so maybe try https://hostname/api/v4/projects (which doesn't require a token) and use a valid response as an indicator?

Oh, using the API is a good idea, didn't think about that one. Could be too. Not sure if everybody is on v4. I assume when a v5 is available, v4 will keep working too (? or maybe users can enable/disable older api versions?), so this might be a good idea.

https://mmb.irbbarcelona.org/gitlab/gelpi/CMIP a public GitLab project, but the https://mmb.irbbarcelona.org/gitlab/api/v4/projects redirects to the sign up page. Now, if you try the v3... that works :thinking:

I think they may be using an older version of GitLab? Note, however, that v2 does not work :smile:

My URL /-/commits/master also appears to be V4 only, as that also doesn't work for that irbbarcelona.org repo :smile:

mr-c commented 1 year ago

TIL! I thought all hosted or on-premise GitHub/Bitbucket services were private

Oh, using the API is a good idea, didn't think about that one. Could be too. Not sure if everybody is on v4. I assume when a v5 is available, v4 will keep working too (? or maybe users can enable/disable older api versions?), so this might be a good idea.

https://mmb.irbbarcelona.org/gitlab/gelpi/CMIP a public GitLab project, but the https://mmb.irbbarcelona.org/gitlab/api/v4/projects redirects to the sign up page. Now, if you try the v3... that works thinking

I think they may be using an older version of GitLab? Note, however, that v2 does not work smile

My URL /-/commits/master also appears to be V4 only, as that also doesn't work for that irbbarcelona.org repo smile

There could be enough signal even in "failed" attempts. For example curl -v https://mmb.irbbarcelona.org/gitlab/api/v4/projects show that a _gitlab_session cookie is set, even though it redirects

For GitLab detection, I suggest trying a variety of endpoints, checking for a gitlab cooke, valid response, or other signal;

kinow commented 1 year ago

For GitLab detection, I suggest trying a variety of endpoints, checking for a gitlab cooke, valid response, or other signal;

Sounds good to me! We can then iterate and improve based on used feedback.