port HCA to use OLS 4 - Githubissues

amnonkhen commented 1 year ago

2 HCA components use OLS:

ingest-ui
ingest-validator

options:

use public ols
~upgrade private hosted OLS to use OLS 4~

Option 1 is preferable in terms of maintenance. Amnon and Enrique need to check whether public OLS can be used or what the gap is. See discussion from 10/7 by James, Enrique and Amnon

James' github handle - @udp

Deadline for decommission of public OLS 3 - Oct 2023

amnonkhen commented 1 year ago

I am porting ingest-ui to work with OLS4. There are 2 types of calls:

/select with term id as the query: example
/select select for all terms (*) in an ontology, example

The first one works well. 2nd one results in a 500. I created ticket EBISPOT/ols4#418 and notified @udp.

gabsie commented 9 months ago

@amnonkhen @Jeena-Rajan This has now been bumped up in urgency, ever since the last managed access meeting. We will need to be able to tag HCA projects with DUO codes, which are driven by an ontology. Because of this, please provide an estimate of time of completion, but our needs would probably be for a few weeks from today. Thanks!

Jeena-Rajan commented 3 months ago

Gabby has mentioned will be required for schema updates happening within the next 2 weeks.

amnonkhen commented 2 months ago

For ingest-validator, the work would involve:

[x] 1. find OLS api calls used
1. expand curie code: CurieExpansion.expandCurie()
  - used to find the iri of a term
  - example: /api/search?q=MONDO:0018177&exact=true&groupField=true&queryFields=obo_id
2. validate code: findChildTerm in GraphRestriction
  - used to check whether a term (q param) is a child of another term (allChildrenOf param) in the given ontologies (ontology param)
  - example: /api/search?q=MONDO:0018177&exact=true&groupField=true&allChildrenOf=http://purl.obolibrary.org/obo/MONDO_0000001&ontology=mondo,efo&queryFields=obo_id
[x] 2. compare the results in OLS3 and OLS4
- identical inputs yield identical outputs
[x] 3. If there are differences in the required inputs or outputs, change the code accordingly
- no need according to step 2
[x] 4. Check required terms are available:
- from ticket ebi-ait/hca-ebi-wrangler-central#844 - gender/visium
  - parent term: NCIT:C17357
  - example terms exist in OLS4:
  - NCIT:C46121
  - NCIT:C180329
[x] 5. change ingest-validator's startup code & configuration to use OLS4 and deploy to dev
- link to ingest-kube-deployment commit
- link to ingest-validator commit
- link to gitlab pipeline
[x] 6. manually test SS2 dataset from the integration tests

This showed a validation error for the biomaterial:

Could not retrieve IRI for HANCESTRO:0004.

I added to the error message the url and it shows as:

Could not retrieve IRI for HANCESTRO:0004. OLS URL: https://www.ebi.ac.uk/ols4/api/search?q=HANCESTRO%3A0004&exact=true&groupField=true&queryFields=obo_id at .human_specific.ethnicity[0].ontology

When I add the OLS response to the log I notice that:

for term HANCESTRO:0004 two documents are returned
for term data:0006 zero documents return
- both these cases result in the Could not retrieve IRI message.

see discussion of these errors in this comment below

[ ] 7. Wranglers to validate that related dataset ingest properly on the dev environment
- @ESapenaVentura & @idazucchi to add a few more datasets here

gabsie commented 2 months ago

Hey, @amnonkhen

Thanks for prioritising this. @ESapenaVentura @idazucchi - who can do the testing at the end?

amnonkhen commented 2 months ago

In a meeting between @ESapenaVentura @arschat & @amnonkhen we found the following:

1) data:0006 - This value is missing from OLS4 because the EDAM ontology modified the prefix from data: to EDAM:. The resolution here would be:

in any active datasets modify data: values to EDAM:
in future updates for documents that use data: the validation would fail until the wrangler/contributor fixes the document in a similar way
schema modification to use EDAM: see HumanCellAtlas/metadata-schema#1572 2) HANCESTRO:0004
1. There is inconsistency between OLS dev and PROD. Dev does not return a result from the hancestro ontology, while prod does. This should be communicated to the OLS team. See OLS ticket EBISPOT/ols4#727
2. The returning document appear identical while their obo_id fields differ only in case. A workaround to multiple documents returning could be to check whether they all have an identical obo_id (possibly with a different case), and if so, treat them as a single document.

idazucchi commented 2 months ago

As discussed with Enrique I've uploaded a selection of projects to the dev environment to test the ontology. All of them have errors due to the EDAM change, a few of them also have errors due to HANCESTRO. I haven't seen any other type of error

If you need more projects let me know through slack please Project 1 - b5d05080-5417-4c22-89f8-4a5eff18d1f9 Project 2 - ec30720b-cfe2-424f-9655-a69a132e2883 Project 3 - cab32543-a44e-496e-b889-f1459686e59f Project 4 - 554e28e7-0b2d-4e69-a117-4e9f45e347b5 Project 5 - a1251330-f57a-4f96-8339-00e77f745e6e Project 6 - f7bfc34e-1ee2-43eb-bb7f-b3c65bbfaa67 Project 7 - f2600a33-4d20-469a-a5cc-3f9581856e10 Project 8 - c5123642-fb9a-44fb-8ab0-55709f688456 Project 9 - 957d3311-6f02-4167-9f5a-0adfe4441565 Project 10 - 72b36d11-4a2d-49b8-8b54-54e1c7d0e306

amnonkhen commented 2 months ago

After deploying the change to the IRI resolution in ingest-validator that groups documents with identical obo_id together, these projects pass validation. The only errors are due to missing files.

Before this ticket can be finished, the schema PR HumanCellAtlas/metadata-schema#1572 needs to be merged to dev so that the integration test can pass.

amnonkhen commented 3 weeks ago

The changes are deployed to dev environment successfully, but during deployment to prod there are errors: in the gitlab job log:

$ helm upgrade --debug -f k8s/apps/$ENVIRONMENT_NAME.yaml $APP_NAME k8s/apps/$APP_NAME --set-string environment=$ENVIRONMENT_NAME,image=$RELEASE_TAG,replicas=$REPLICAS,gitlab_app=$CI_PROJECT_PATH_SLUG,gitlab_env=$CI_ENVIRONMENT_SLUG --wait --install
WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /builds/hca/ingest-validator.tmp/KUBECONFIG
WARNING: Kubernetes configuration file is world-readable. This is insecure. Location: /builds/hca/ingest-validator.tmp/KUBECONFIG
history.go:56: [debug] getting history for release ingest-validator
upgrade.go:123: [debug] preparing upgrade for ingest-validator
upgrade.go:1[31](https://gitlab.ebi.ac.uk/hca/ingest-validator/-/jobs/1841787#L31): [debug] performing update for ingest-validator
upgrade.go:303: [debug] creating upgraded release for ingest-validator
client.go:201: [debug] checking 1 resources for changes
wait.go:53: [debug] beginning wait for 1 resources with timeout of 5m0s
wait.go:244: [debug] Deployment is not ready: staging-environment/ingest-validator. 0 out of 1 expected pods are ready

Then the gitlab job times out.

amnonkhen commented 3 weeks ago

Further inspection of the k8s prod env shows:

❯ kubectl get pods -l app=ingest-validator
NAME                                READY   STATUS              RESTARTS        AGE
ingest-validator-54ff49d58f-f862g   0/1     ContainerCreating   0               48d
ingest-validator-6fd4d775c5-4p88k   1/1     Running             48 (157d ago)   401d

It appears that there is a problem mounting the secret and configmap:

❯ kubectl describe pods -l app=ingest-validator
Name:           ingest-validator-54ff49d58f-f862g
Namespace:      prod-environment
...
Events:
  Type     Reason       Age                     From     Message
  ----     ------       ----                    ----     -------
  Warning  FailedMount  22m (x24140 over 48d)   kubelet  Unable to attach or mount volumes: unmounted volumes=[secret-volume], unattached volumes=[secret-volume kube-api-access-tp2f5]: timed out waiting for the condition
  Warning  FailedMount  3m6s (x34378 over 48d)  kubelet  MountVolume.SetUp failed for volume "secret-volume" : references non-existent secret key: ingest-service-account-auth-info

There is a missing secret ingest-service-account-auth-info.

amnonkhen commented 3 weeks ago

The secret was missing from staging and prod.

[x] install it using:

source ~/dev/ingest-kube-deployment/config/environment_staging
kubectx ingest-eks-staging
make deploy-secrets
kubectl get secrets api-keys -o jsonpath="{.data.ingest-service-account-auth-info}"  
source ~/dev/ingest-kube-deployment/config/environment_prod 
kubectx ingest-eks-prod
make deploy-secrets
kubectl get secrets api-keys -o jsonpath="{.data.ingest-service-account-auth-info}"

[x] deploy to staging and prod

ebi-ait / dcp-ingest-central

port HCA to use OLS 4 #963