Closed zxenia closed 2 years ago
@zxenia what's the desired format for the CONP database URIs for datasets?
At the moment I can see portal datasets have a URI of the form:
https://portal.conp.ca/dataset?id=projects/PERFORM_Dataset__one_control_subject
How are these generated (e.g. when we have new data come in), and do we want to use this form as the resource ID? Something like just
https://portal.conp.ca/dataset/PERFORM_Dataset__one_control_subject
might be cleaner, but I am not sure if there is a reason not to adopt this.
Adopting these dataset based resource IDs is going to be helpful also for updating the existing records in nexus when we make changes to the meta-data on our end - that is, as opposed to just deprecating the original version and creating a new record. So it would help to make sure that we also have an easy way of tying the resource ID / URI unambiguously to a given dataset in the future.
Hi @surchs
I think we will need to stick with these URIs https://portal.conp.ca/dataset?id=projects/PERFORM_Dataset__one_control_subject
because it's how the conp portal route for a single dataset is designed. If we have those URIs in Nexus then the search results will be immediately resolvable to a dataset's detail page.
To have this https://portal.conp.ca/dataset/PERFORM_Dataset__one_control_subject
will probably require changes to url path.
And if we change them then we have to make the old urls resolvable to those new ones.
How are these generated
First, I thought that a dataset's database id (or path) is generated from its title and prefixed with projects/
- the title is being split by words and added dashes or underscores.
But I cross-checked and I see that it's not true and seems like it's arbitrary looking at the below examples:
title: BigBrain dataset - 3D Classified Volumes (derived dataset)
url: /dataset?id=projects/BigBrain_3DClassifiedVolumes
title: Multicenter Single Subject Human MRI Phantom
url: /dataset?id=projects/multicenter-phantom
@cmadjar Do you know how a dataset's id is generated in the database?
@zxenia I think the dataset ID is taken from the submodule name of the dataset you can find in the .gitmodules
file of the conp-dataset repo (see the example of the ID of bigbrain-datalad
: this is the name of the git submodule in .gitmodules
but the folder name under projects
is BigBrain
.
Works for the examples you mentioned as well BTW.
@cmadjar ok got it now, thank you!
@cmadjar and @zxenia: I had a chat with Adeel from the Nexus team today:
@id
be the CONP portal URL is tricky. The main reason here is that Nexus does not allow the re-use of an ID, even when the original resource is deleted / deprecated.The second point means that if we have a dataset added on CONP (and pushed to Nexus), and the same dataset is then removed from CONP (and this is propagated to Nexus as a deprecation), and then the same dataset is added again to CONP (i.e. with the same submodule name), the original webportal URL could no longer be used as the Nexus ID.
There are two options to deal with this:
https://reservoir.global/v1/resources/.../UUID
to something like conp.ca/UUID
conp_portal_url = https://conp.ca/someurl
) and then expose this in our search queries. Given our naming convention, this can be done as part of the Nexus upload workflow by Adeel (probably good). Alternatively we can add it as part of the jsonld conversion step (probably less good)softDelete=False/0
) that we simply filter for in our queries.True
.I feel like this is something worth discussing in the dev call. So I'd bring this up this week. In the meantime, Adeel will update his ingestion pipeline and we'll take another look at the data that ends up in Nexus. My own tests on Nexus had worked well a couple of weeks ago, so I'd expect this to work without issues. Once we have solved the ID question, Adeel can then just turn the extended SDO data live and have the CONP SPARQL endpoint query them directly.
@surchs Thanks! Yes, it would be good to discuss the IDs issue on a call. I think we need to look into this soft delete option, but if it requires too much maintenance then adding the attribute in DATS that holds the URL is a good option too.
At the dev call, the decision was made to go with option 1. Once it's done and the data is updated in Nexus, I will need to adjust example SPARQL queries in order to return a conp portal resolvable URL for each dataset.
Thanks @zxenia. I will let you know once that is done and then we can discuss changing the queries.
Update (28/10/2021):
In the meantime, we can work on #506
Just to update here: I had checked in with Adeel and the Nexus portal remains unaccessible. He'll let me know once that changes and I'll update here.
Thank you for the update :)
Hey, good news. Adeel let me know that Nexus is up again. I'll have a chat with him regarding our updated conp_url
attribute and make sure the canned queries are running well.
I met with Adeel, checked the ingested data, and changed the canned queries (see linked PR). This should take care of this issue.
@surchs I guess this could be closed now that the PR is merged, correct?
Yes @cmadjar, this is now also taken care of!
Purpose
The data in Nexus needs to be updated since the context mapping to schema.org was improved. Also currently there is possible data duplication in Nexus for PREVENT-AD registered data dataset - it needs to be investigated.
Ideally, this update will also include migration to proper dataset identifiers: instead of UUID to a stable identifier from the CONP database. This improvement will allow being redirected from Nexus search results to an individual dataset page. Currently, Nexus results contain dataset resolvable ids that redirect to Nexus dataset representation in JSON.