[Epic] Develop the challenge data ingress workflow for read-only REST API

tschaffter commented 1 year ago

The goal of this Epic is to design and develop the challenge data ingress workflow that we will use to populate the challenge registry until our UI and REST API support write operations. These features should not be developed before 2023 Q2.

We need to identify how to perform the following operations:

Collect challenge data from Challenge organizations
Validate the challenge data submitted
Push the challenge data submitted to the challenge registry

Tasks

[x] Explore CQRS pattern
[x] #1273
[x] #1145
[x] #1147
[x] #1328
[ ] #1324
[ ] #1325
[ ] #1326

tschaffter commented 1 year ago

Architecture

Pushing data to the challenge registry

Let's start with how we could push data to the registry.

There are two reasons we won't use the REST API to push data to the database for this epic:

The REST API won't support write operations until 2023 Q2.
We implemented this approach first when we were developing the prototype and this was more complicated than expected. The reason was that we had to track the response of queries (e.g. get the ID of the challenge newly created), then use this information to push other artifacts such as challenge team members, sponsors, etc. We then took another approach with the development of the challenge registry DB CLI where we dump challenge data into the databse, one table at a time.

Note Rong was generating the JSON files from CSV data.

I made sure we could build the CLI DB after moving it to the monorepo but it has not been maintained and has since been "disconnected" (by renaming its project.json to project.json.off). I will take care of reconnecting the DB CLI to the monorepo to make sure that it can be built and run.

Legacy DB CLI: https://github.com/Sage-Bionetworks/rocc-db-client DB CLI in the monorepo: apps/challenge-registry/db-cli

The DB CLI was developed to push data to a MongoDB. Since then we have switched to a MariaDB instance (SQL). The DB CLI must then be updated to push data to the MariaDB instance of the challenge registry project.

The legacy DB CLI includes different seeds (JSON data) that we were pushing to the MongoDB instance.

https://github.com/Sage-Bionetworks/rocc-db-client/tree/main/data/seeds

Validating the challenge data

There are two ways of validating the data.

Develop a Schema.org/JSON-LD schema and use it to validate the JSON data.
When pushing the data to the MariaDB instance. E.g. pushing a number to a VARCHAR will result in an error.

Ideally we should use both approaches. Developing and sharing a Schema.org/JSON-LD schema of the data included in the registry is a must down the road. Moreover, the pages of the challenge registry can now expose JSON-LD data. The chema.org/JSON-LD schema would largely contribute to the JSON-LD metadata that we will ultimately embed in the pages.

We can use the Schema.org/JSON-LD schema in a GitHub workflow to validate the data submitted by a contributor without needing a running instance of the MariaDB (faster, easier the validate).

Collecting the challenge data

The advantage of JSON data is that we can validate them with a Schema.org/JSON-LD schema. It is also easy to manipulate JSON data programmatically. The question is whether we work only with JSON files or if we also want to support Spreadsheet format (referenced here as CSV).

The CSV format could make be more human-friendly for external contributors - and potentially us too since we have data in CSV format - to input their data. Note that this is the approach that the FAIR Data workstream has developed to enable Data Coordinating Center (DCC) to push their data. Given the short amount of time that separate us from the release of the private preview, we may not be able to setup the tools developed the FAIR Data workstream (DCC Validator, DCC Curator, etc.). Yet we could contact Milen to get insights into the CSV to JSON conversion if we want to support this feature.

Additional consideration:

Challenge data collected programmatically, e.g. from Kaggle, would already be in JSON format. Converting to CSV format would not be required here.
An organization with a relatively small number of challenges (<20?) could probably work directly in a JSON files. If more challenges are available and the data must be collected manually, collecting these data in a Spreadsheet would be more contributor friendly.
Most of the challenge organizations that most likely will required manual collection and that we want to onboard for the private preview have between 20 and 100 challenges. We expect organizations with a larger number of challenges to provide a way to programmatically access the data (Kaggle, CodaLab, Codabench).

Given the above considerations, I think that it's important to make it as easy as possible for organizations to submit their challenge data. Using a spreadsheet is likely the best approach when the data collection is done manually.

Here is how we could proceed:

We develop first the JSON to DB workflow.
We meet with Milen to share with him our data schema and get feedback on how we would write the Schema.org/JSON-LD schema (one per table?) and generate a Google Spreadsheet.
(Optional) I believe that such spreadsheet would already include validation rules. Assuming that it's not the case OR that updating the validation rules would be too difficult (e.g. following a change we would have made to the schema). If this is not the case, we could setup a GitHub workflow that regularly get the CSV data, convert them to JSON format and then validate them. We could then open a PR where the contributor can check the status of the validation.

Aggregating challenge data

We could push the data from different contributors to the DB instance one by one. However, it would be preferable that challenges that have been previously submitted keep their ID in the database. This is critical if we expose the DB record ID as suggested in this app route proposal and use them to identify resources in the app routes. This would be less of an issue if we were relying on an immutable natural key (e.g. slug for a challenge or login for a user/organization).

We need to further discuss how we want to aggregate data and how we want to update them. Milen, Bruce and Jay may also provide valuable feedback.

tschaffter commented 1 year ago

@rrchai @vpchung This diagram shows the architecture I have in mind for pulling challenges and bring them to the OpenChallenges DB. The idea is that candidate challenges would be stored by a data curator service. This curator service would talk to the challenge service and organization service to know what resources are effectively in the so-called OpenChallenges DB. To enable full text search and advanced search features, approved challenges would be stored in Elasticsearch, which would then power a search service.

I will present you this architecture more in depth in our next meeting but comments and questions are already welcome.

tschaffter commented 1 year ago

Update

This architecture diagram captures the entire data processing workflow.

The remaining piece of work is the creation of a search service that leverage Elasticsearch, ideally instead of the combo SQL + Elasticsearch as implemented by the challenge and organization services.

tschaffter commented 1 year ago

Update

We make good progress on architecting how the data will flow from the different data sources to the OC DB.

My next highest priority is to develop the search service where challenge and organization data will meet to power the different queries we have discussed over the past few months.

tschaffter commented 1 year ago

Moving to Sprint 23.03

tschaffter commented 1 year ago

Update 2023-04-03

We have a draft of the workflow in Lucidchart, which needs to be cleaned up. I won't have time to work on this Epic this in April but hopefully I can come back to it end of May or June.

Adding tentatively to May sprint.

tschaffter commented 1 year ago

Return to Backlog

Sage-Bionetworks / sage-monorepo