Update Nexus data - Githubissues

zxenia commented 3 years ago

Purpose

The data in Nexus needs to be updated since the context mapping to schema.org was improved. Also currently there is possible data duplication in Nexus for PREVENT-AD registered data dataset - it needs to be investigated.

Ideally, this update will also include migration to proper dataset identifiers: instead of UUID to a stable identifier from the CONP database. This improvement will allow being redirected from Nexus search results to an individual dataset page. Currently, Nexus results contain dataset resolvable ids that redirect to Nexus dataset representation in JSON.

surchs commented 3 years ago

@zxenia what's the desired format for the CONP database URIs for datasets?

At the moment I can see portal datasets have a URI of the form: https://portal.conp.ca/dataset?id=projects/PERFORM_Dataset__one_control_subject

How are these generated (e.g. when we have new data come in), and do we want to use this form as the resource ID? Something like just https://portal.conp.ca/dataset/PERFORM_Dataset__one_control_subject might be cleaner, but I am not sure if there is a reason not to adopt this.

Adopting these dataset based resource IDs is going to be helpful also for updating the existing records in nexus when we make changes to the meta-data on our end - that is, as opposed to just deprecating the original version and creating a new record. So it would help to make sure that we also have an easy way of tying the resource ID / URI unambiguously to a given dataset in the future.

zxenia commented 3 years ago

Hi @surchs

I think we will need to stick with these URIs https://portal.conp.ca/dataset?id=projects/PERFORM_Dataset__one_control_subject because it's how the conp portal route for a single dataset is designed. If we have those URIs in Nexus then the search results will be immediately resolvable to a dataset's detail page. To have this https://portal.conp.ca/dataset/PERFORM_Dataset__one_control_subject will probably require changes to url path. And if we change them then we have to make the old urls resolvable to those new ones.

How are these generated

First, I thought that a dataset's database id (or path) is generated from its title and prefixed with projects/ - the title is being split by words and added dashes or underscores. But I cross-checked and I see that it's not true and seems like it's arbitrary looking at the below examples:

title: BigBrain dataset - 3D Classified Volumes (derived dataset)
url: /dataset?id=projects/BigBrain_3DClassifiedVolumes

title: Multicenter Single Subject Human MRI Phantom
url: /dataset?id=projects/multicenter-phantom

@cmadjar Do you know how a dataset's id is generated in the database?

cmadjar commented 3 years ago

@zxenia I think the dataset ID is taken from the submodule name of the dataset you can find in the .gitmodules file of the conp-dataset repo (see the example of the ID of bigbrain-datalad: this is the name of the git submodule in .gitmodules but the folder name under projects is BigBrain.

Works for the examples you mentioned as well BTW.

zxenia commented 3 years ago

@cmadjar ok got it now, thank you!

surchs commented 3 years ago

@cmadjar and @zxenia: I had a chat with Adeel from the Nexus team today:

changing the current process to use our DATS jsonld annotator should be straightforward. This means, the same GH hooks will trigger when datasets are changed, datasets will then be converted to JSONLD using our contexts, and updated in Nexus
having the resource @id be the CONP portal URL is tricky. The main reason here is that Nexus does not allow the re-use of an ID, even when the original resource is deleted / deprecated.

The second point means that if we have a dataset added on CONP (and pushed to Nexus), and the same dataset is then removed from CONP (and this is propagated to Nexus as a deprecation), and then the same dataset is added again to CONP (i.e. with the same submodule name), the original webportal URL could no longer be used as the Nexus ID.

There are two options to deal with this:

Continue letting Nexus assign (meaningless) IDs.
- we can change the URL prefix of these from https://reservoir.global/v1/resources/.../UUID to something like conp.ca/UUID
- we can add programmatically an additional attribute to the jsonld that contains the conp portal URL (e.g. conp_portal_url = https://conp.ca/someurl) and then expose this in our search queries. Given our naming convention, this can be done as part of the Nexus upload workflow by Adeel (probably good). Alternatively we can add it as part of the jsonld conversion step (probably less good)
Actually have the Nexus ID be the CONP webportal URL
- in this case we would need to have a way to "soft delete" a dataset by ID to sidestep the hard Nexus deprecation constraint.
- this could be another attribute (e.g. softDelete=False/0) that we simply filter for in our queries.
- when a dataset is removed from CONP, we would then not deprecate it in Nexus but instead update the resource by setting this soft-delete attribute to True.

I feel like this is something worth discussing in the dev call. So I'd bring this up this week. In the meantime, Adeel will update his ingestion pipeline and we'll take another look at the data that ends up in Nexus. My own tests on Nexus had worked well a couple of weeks ago, so I'd expect this to work without issues. Once we have solved the ID question, Adeel can then just turn the extended SDO data live and have the CONP SPARQL endpoint query them directly.

zxenia commented 3 years ago

@surchs Thanks! Yes, it would be good to discuss the IDs issue on a call. I think we need to look into this soft delete option, but if it requires too much maintenance then adding the attribute in DATS that holds the URL is a good option too.

zxenia commented 3 years ago

At the dev call, the decision was made to go with option 1. Once it's done and the data is updated in Nexus, I will need to adjust example SPARQL queries in order to return a conp portal resolvable URL for each dataset.

surchs commented 3 years ago

Thanks @zxenia. I will let you know once that is done and then we can discuss changing the queries.

surchs commented 3 years ago

Update (28/10/2021):

We have tested the canned queries with the extended schema.org contexts in a nexus test project and they work well.
I have discussed the required changes with Adeel. He will update his workflow to use our DATS annotator to create the DATS.jsonld with the extended context
Adeel will also handle the creation of the URL attribute for the CONP dataset inside nexus, I explained the URL structure to him (i.e. based off of the git submodule name).
The only thing keeping us from switching over to the updated SDO data is that the nexus / reservoir instance is down since the beginning of the month
As soon as it is not down anymore, I'll follow up with Adeel to make sure the changes are completed.

In the meantime, we can work on #506

surchs commented 2 years ago

Just to update here: I had checked in with Adeel and the Nexus portal remains unaccessible. He'll let me know once that changes and I'll update here.

cmadjar commented 2 years ago

Thank you for the update :)

surchs commented 2 years ago

Hey, good news. Adeel let me know that Nexus is up again. I'll have a chat with him regarding our updated conp_url attribute and make sure the canned queries are running well.

surchs commented 2 years ago

I met with Adeel, checked the ingested data, and changed the canned queries (see linked PR). This should take care of this issue.

cmadjar commented 2 years ago

@surchs I guess this could be closed now that the PR is merged, correct?

surchs commented 2 years ago

Yes @cmadjar, this is now also taken care of!

CONP-PCNO / conp-portal

Update Nexus data #484

Purpose