MaastrichtU-IDS / d2s-argo-workflows

⚠️ DEPRECATED: Argo workflows to transform structured data to a target RDF using Data2Services Docker modules
http://d2s.semanticscience.org/
MIT License
3 stars 0 forks source link

Integrate DQA pipeline with Argo #1

Closed vemonet closed 4 years ago

vemonet commented 4 years ago

Descriptive statistics

We should adapt and reuse the new implementation at: https://github.com/MaastrichtU-IDS/d2s-scripts-repository/tree/master/sparql/compute-hcls-stats I think we should just properly integrate those queries in https://github.com/MaastrichtU-IDS/d2s-scripts-repository/tree/master/sparql/compute-hcls-stats

Fairsharing

Just a dockerized python script New API in dev: https://github.com/FAIRsharing/FAIRsharing-API https://github.com/MaastrichtU-IDS/fairsharing-metrics

RDFUnit

https://github.com/AKSW/RDFUnit

Validate full SPARQL endpoint. Slowly, we might need to split the validation by graphs

ShEx

https://github.com/hsolbrig/PyShEx https://github.com/iovka/shex-java would be alternative

Imho we should build a layer over PyShEx to validate an exhaustive subset of the KG (which could be extracted using HCLS)

vemonet commented 4 years ago

Started here: https://github.com/MaastrichtU-IDS/d2s-argo-workflows/blob/master/dqa-workflow-argo.yaml

vemonet commented 4 years ago

Run the workflow:

argo submit dqa-workflow-argo.yaml -f support/config-dqa-pipeline.yml
vemonet commented 4 years ago

Running exactly the same Docker image with the same parameter works in pure Docker:

docker run --rm -it -v /data/dqa-workspace:/data aksw/rdfunit:latest -d http://sparql.wikipathways.org/sparql -f /data -s "https://www.w3.org/2012/pyRdfa/extract?uri=http://vocabularies.wikipathways.org/wp#" -o ttl

But give this error when run via Argo:

[ERROR] No plugin found for prefix 'exec' in the current project and in the plugin groups [org.apache.maven.plugins, org.codehaus.mojo] available from the repositories [local (/root/.m2/
repository), central (https://repo.maven.apache.org/maven2)] -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/NoPluginFoundForPrefixException

According to this issue it means that the java exec plugin is missing from the pom.xml: https://stackoverflow.com/questions/34770106/no-plugin-found-for-prefix-exec-in-the-current-project-and-in-the-plug-in-grou

See those 2 poms for the whole project and the validate:

The question is : why this plugin is not missing when doing a simple docker run but fails when running through Argo?

Globally RDFUnit docker container (and its pom.xml) seems to not be appropriate, so we would need to rewrite how it compiles (the only hard part of this will be to make sure we compile the 2 well)

vemonet commented 4 years ago

Issue submitted to the RDFUnit repo: https://github.com/AKSW/RDFUnit/issues/98

The Pod definition I use for the test: https://github.com/MaastrichtU-IDS/d2s-argo-workflows/blob/cd8b1432940595e6ff52b7efaa339f5d653aa609/tests/test-devnull-pod.yaml

Commands to run the test pod and connect to it (from the d2s-argo-workflow repo):

kubectl create -f tests/test-devnull-pod.yaml
kubectl exec -it test-devnull-pod -- /bin/bash

Documented here (for info): https://maastrichtu-ids.github.io/dsri-documentation/docs/openshift-debug

vemonet commented 4 years ago

I fixed RDFUnit to be packaged as a standalone jar in the RDFUnit Docker container. So using it from any path with any workdir set will work (available at https://hub.docker.com/repository/docker/umids/rdfunit )

What has been done:

See commits: