SciCatProject / scicat-backend-next

SciCat Data Catalogue Backend
https://scicatproject.github.io/documentation/
BSD 3-Clause "New" or "Revised" License
19 stars 21 forks source link

Add field for capturing dataset links #689

Open dylanmcreynolds opened 1 year ago

dylanmcreynolds commented 1 year ago

Background

SciCat allows users to add thumbnails (stored in mongo) with each data set that is ingested. This allows facilities to provide some very rudimentary visualization alongside datasets.

There is also a "Jupyter" button on datasets that can only be configured to take the user to a particular Jupyter(Hub) instance. This cannot be coded to take users to a particular notebook, nor is there a method to associate a notebook with a particular dataset once opened.

We get a lot of requests from users and researchers to provide a variety of different types of web-based tools that related directly to datasets. These can be pages that display visualization, pages that provide analysis tools (including AI/ML) and jupyter notebooks on particular JupyterHub instance. In all cases, we would want to take the user to the page with the dataset in context.

Proposal

I can imagine a lot of solutions in SciCat. We could start maintaining visualizations on a technique/instrument basis within SciCat, but the wide variety of instruments, techniques and tools used across the Scicat facilities makes this a truly daunting task.

Instead, I propose moving the task of creating links for datasets ingestors. I proposed adding a new field to the dataset models (raw and derived) called links. This would be a list of link object that contain urls, display and type, something like:


"links": [
  {"type": "analysis", "display": "Reconstruction Notebook", "link": "http://magrathea.org/jupyter/a/b/c/notebook.ipynb?scicat_dataset=abc123"},
  {"type": "visualization", "display": "Raw Data Visualization", "link": "http://magrathea.org/planet_viz?scicat_dataset=abc123"},
  {"type": "segmentation", "display": "Fjord Segmentation Application", "link": "http://magrathea.org/fjord_seg?scicat_dataset=abc123"},
]

The frontend could then display these items as href links. We might even think about adding another optional embed field to each link that lets the frontend decide wether to embed the web page in an iframe tag so that the visitation opens when you view the dataset.

If this were approved by the community, I could put it into the backend now, then follow up with a PR to the front end to display the links.

nitrosx commented 1 year ago

@dylanmcreynolds this would be really useful, thumbs up from me. Let's discuss it at the next meeting.

I would also include an icon to be shown. The equivalent json would be:

"links": [
  {
    "type": "analysis", 
    "user_name": "Reconstruction Notebook", 
    "url" : "http://magrathea.org/jupyter/a/b/c/notebook.ipynb?scicat_dataset=<dataset.pid>",
    "icon" : "<path_to_the_icon_file>",
  },
 ...
]

On second thought, why do we need the type?

dylanmcreynolds commented 1 year ago

@bpedersen2 @bolmsten @nitrosx @mkywall

We talked about this design in the developer's meeting today, and agreed that is was too simple. Leaving links to get stale in datasets will create future headaches.

The consensus seemed to be to configure links as a data structure that allows for rules to be applied for each dataset. So, the new idea is to NOT put links in datasets at all. We debated (with no current consensus) making it a json configuration on the server, or possibly another collection in Mongo. Either way, we would add fields to each link that can be matched for each dataset.

Additionally, we'll now need a templating mechanism so that we can put a reference to the dataset into the link that we create. I propose that we have two fields:

"links": [
  {
  "datasetSelectors": [
        "technique": ["earth observation", "mouse analysis"],
        "datasetType": ["raw"],
        "instrument": ["earth"]
    ],  
    "display": "Reconstruction Notebook", 
    "urlTemplate" : "http://magrathea.org/jupyter/a/b/c/notebook.ipynb?scicat_dataset={datasetPID}",
    "icon" : "<path_to_the_icon_file>",
  },

]

So, with that in mind, the Dataset's GET endpoint would perform something like the following:

For each link in links
   For each field in datasetSelectors
     if the field matches one of the corresponding items in the dataset
        replace any template fields
        add to the list of links to display
Add list of links to the dataset in a `links` field.

Questions:

bpedersen2 commented 1 year ago
  • [ ] Are techniques by name or pid? Same question with instruments.

I think configuration by name should be preferred, as it is much more readable.

  • [ ] Should any field in the datasets schema be available for matching in the selectors?

If possible without too much problems, yes. Then it allows a much more fine grained configuration. Otherwise probably at least datasetlifecycle ( and sub-fields) and a field (which is not yet present) for the filetype would be good to have

  • [ ] Should the link selector be a server json file configuration or a catalog in the database?

Both should work.

file:

nitrosx commented 1 year ago

Regarding the techniques, currently in datasets, techniques are stored as the following structure:

{
  "pid": "<Technique-ID>",
  "name": "<common-techinque-name>"
}

Given that techniques are defined by external ontologies, I think we should maintain the same definitions in links. By the way, this is the ontology that we use for techniques in the neutron and proton community: https://bioportal.bioontology.org/ontologies/PANET

dylanmcreynolds commented 1 year ago

@bpedersen2 well said! The thing that makes me most nervous from your list is requiring a server restart to catch changes. I can imagine that coding external apps and configuring these links are on an independent timeline from server maintenance. But that's fairly far into the future. I say let's keep it simple for now, while develop and learn.

nitrosx commented 9 months ago

@dylanmcreynolds regarding the configuration of links, we could implement it as the functional accounts are currently implemented. If the functional account in configuration is new, it is created in database. If it is already present, nothing happens. IF any changes needs to happen before the next system reboot, we can modify directly in database. Of course this flexibility shifts on the instance managers the responsibility to propagate the live changes to the configuration if needed.