Support creation of unlimited graph types

gothub commented 4 years ago

@mbjones here are proposed changes to the quality engine to support generation/retrieval of any number of assessment graph types for a set of data (portal, member node, all of DataONE).

The quality engine should allow the creation of any number of graphs for a set of metadata. For example, for the metadata associated with a DataONE portal (i.e the collectionQuery pids), any number of different assessment graphs should be created and available when this portal is updated. The current list of desired graphs are "monthly", "cumulative", "check-analysis", but there could be many more.

The current REST endpoints to create and retrieve a graph for a portal is shown here with an example curl command:

curl -X POST 'https://docker-ucsb-4.dataone.org:30443/quality/scores?id=urn:uuid:06a249da-f9c8-4946-8be4-ac80883bbd22&suite=FAIR-suite-0.3.1'

curl -X GET -H "Accept: image/png"  'https://docker-ucsb-4.dataone.org:30443/quality/scores?id=urn:uuid:06a249da-f9c8-4946-8be4-ac80883bbd22&suite=FAIR-suite-0.3.1'

Note that the 'id' currently can either be a portal series id (urn:uuid:*) or a node id (e.g. urn:node:CA_OPC).

The quality engine and API should be extended to support these additional parameters for retrieval:

format=
'
- for example: '&format=eml' which would filter the input data based on any EML format type
graphType=
- for example: '&graphType=check-analysis'

The request to generate data and a graph should not include the type of graph to create, as all known graph types and variations should be created and made available for retrieval, based on only the id and suite.

The scripts that generate each graph type could follow a naming convention, so that the quality engine could automatically run them when they are added to the quality engine.

mbjones commented 4 years ago

Thanks @gothub . This is great. I think we could tweak a few details to improve it. Here's a few questions and comments:

1) why is there a URI for generating a graph? Wouldn't all graphs be generated whenever needed, typically on first creation of a collection, or on update of a suite, or on a timed schedule via a queue? Seems like another process should control queuing up these graph generation jobs, and not a REST URI. I am also sitting here thinking about whether we should generalize it to correspond to a process/script to be run that might do various analytical tasks, and produce some sort of well-defined output like a graph, but not limited to a graph. As I think about this, the pattern converges on Clowder more and more. 2) the format filter is a bit fuzzy. wouldn't it be best to call it formatId, and be a repeatable list of formats to include (ORed together)? Also, do we really need format at all -- wouldn't it be best to create a collection with the relevant datasets filtered (e.g., by formatId)? Then, we wouldn't have to treat format differently at all. 3) Do we want the URL to include quality, given our discussion of how that word is loaded? Can we come up with a better service collection name? /assessments/? /runs/? /results/? something else? 4) The graphType looks good, and will probably work. We could consider renaming it to productType, or even incorporating it directly into the resource URI, which would be the more restful way to encode this. It also gets rid of the content negotiation, and makes it much easier for clients to request (it's hard for example to set the accept header in a browser url bar). For example, an alternative URI form could be:

Overall pattern: GET /assessments/{suite}/{identifier}/{productType}
GET /assessments/FAIR-suite-0.3.1/urn:uuid:06a249da-f9c8-4946-8be4-ac80883bbd22/cumulative
GET /assessments/FAIR-suite-0.3.1/urn:uuid:06a249da-f9c8-4946-8be4-ac80883bbd22/monthly
GET /assessments/FAIR-suite-0.3.1/urn:uuid:06a249da-f9c8-4946-8be4-ac80883bbd22/check-analysis
GET /assessments/FAIR-suite-0.3.1/urn:uuid:06a249da-f9c8-4946-8be4-ac80883bbd22/score-csv
GET /assessments/FAIR-suite-0.3.1/urn:uuid:06a249da-f9c8-4946-8be4-ac80883bbd22/check-csv

That is a more RESTful pattern, where the suite is treated as a collection, the data collection comes second, and the product type is the resource that's available for that collection. It also unifies the graph and csv retrieval to get rid of the difficult Accept header, and opens the door to new product types that are neither graphs nor csv files, like PDFs.

I'd like to get feedback from @csjx and others on this as well. Let's discuss.

gothub commented 4 years ago

@mbjones thx for the review - here are some thoughts on the points you raised:

Using a REST endpoint to generate an assessment (graph, data) retains the possibility of having DataONE MNs/CN queue requests (via metacat) when metadata or portal documents are created/updated, which was part of the original design. I've diverged from this graph a bit, as all requests (for generation or retrieval) are routed through metadig-controller. However, currently, the metadig-scheduler container is the only entity sending generation requests. These requests are based on the harvesting taks which watch for new/updated metadata and portal documents.
Regarding the format filter, it seemed a bit cumbersome to have to specify every formatId for a desired graph, for example, for EML it would be '&formatId= eml://ecoinformatics.org/eml-2.0.0&formatId= eml://ecoinformatics.org/eml-2.0.1..., vsformat=eml`, for the entire EML format family. If a requirement is to be able to retrieve different graphs based on filtering of formatId, then there needs to be a way for a client to specify what filter was applied, e.g. "give me all assessments for ISO metadata". Regarding the generation request, the engine could create graphs for all pre-defined filters, i.e. one graph for only EML content, one for ISO, one for no filters applied, so no filter would need to be specified for the request.

mbjones commented 4 years ago

Thanks @gothub let's discuss next week with @csjx

gothub commented 3 years ago

@mbjones @csjx when would be a good time to discuss/enumerate the range of products that need to be generated and retrievable, and how that is represented in the API.

The current potential list of product types:

graphs
- scores as cumulative average
- scores aggregated by month
- check-analysis: all checks for a suite, with each check failure/success percent, grouped by category (i.e. F,A,I,R)
- check-analysis: optional/required checks summarized by category
- other graphs TBD
data files (CSV)
- scores as cumulative average
- scores aggregated by month
- check-analysis:
- each line contains
  - check_id, check_name, check_type, check_level, status, data_source, pid, obsoletes, obsoleted_by, sequence_id
  - e.g.: "resource.creatorIdentifier.present.1", "Resource Creator Identifier Present", "Findable", "REQUIRED", "FAILURE", "urn:node:ARCTIC", "doi:10.18739/A2RB6W25S", NA , "urn:uuid:8cdb22c6-cb33-4553-93ca-acb6f5d53ee4", "urn:uuid:8cdb22c6-cb33-4553-93ca-acb6f5d53ee4"

Each of these products can be generated for or filtered by the following:

for all of DataONE, or an MN, or a collection (portal)
graph including one or multiple metadata formats (EML, ISO, DataCite, schema.org)
for a specified assessment suite

NCEAS / metadig-engine

Support creation of unlimited graph types #268