Open tschaffter opened 1 year ago
Let's start with how we could push data to the registry.
There are two reasons we won't use the REST API to push data to the database for this epic:
Note Rong was generating the JSON files from CSV data.
I made sure we could build the CLI DB after moving it to the monorepo but it has not been maintained and has since been "disconnected" (by renaming its project.json
to project.json.off
). I will take care of reconnecting the DB CLI to the monorepo to make sure that it can be built and run.
Legacy DB CLI: https://github.com/Sage-Bionetworks/rocc-db-client
DB CLI in the monorepo: apps/challenge-registry/db-cli
The DB CLI was developed to push data to a MongoDB. Since then we have switched to a MariaDB instance (SQL). The DB CLI must then be updated to push data to the MariaDB instance of the challenge registry project.
The legacy DB CLI includes different seeds (JSON data) that we were pushing to the MongoDB instance.
https://github.com/Sage-Bionetworks/rocc-db-client/tree/main/data/seeds
There are two ways of validating the data.
Ideally we should use both approaches. Developing and sharing a Schema.org/JSON-LD schema of the data included in the registry is a must down the road. Moreover, the pages of the challenge registry can now expose JSON-LD data. The chema.org/JSON-LD schema would largely contribute to the JSON-LD metadata that we will ultimately embed in the pages.
We can use the Schema.org/JSON-LD schema in a GitHub workflow to validate the data submitted by a contributor without needing a running instance of the MariaDB (faster, easier the validate).
The advantage of JSON data is that we can validate them with a Schema.org/JSON-LD schema. It is also easy to manipulate JSON data programmatically. The question is whether we work only with JSON files or if we also want to support Spreadsheet format (referenced here as CSV).
The CSV format could make be more human-friendly for external contributors - and potentially us too since we have data in CSV format - to input their data. Note that this is the approach that the FAIR Data workstream has developed to enable Data Coordinating Center (DCC) to push their data. Given the short amount of time that separate us from the release of the private preview, we may not be able to setup the tools developed the FAIR Data workstream (DCC Validator, DCC Curator, etc.). Yet we could contact Milen to get insights into the CSV to JSON conversion if we want to support this feature.
Additional consideration:
Given the above considerations, I think that it's important to make it as easy as possible for organizations to submit their challenge data. Using a spreadsheet is likely the best approach when the data collection is done manually.
Here is how we could proceed:
We could push the data from different contributors to the DB instance one by one. However, it would be preferable that challenges that have been previously submitted keep their ID in the database. This is critical if we expose the DB record ID as suggested in this app route proposal and use them to identify resources in the app routes. This would be less of an issue if we were relying on an immutable natural key (e.g. slug for a challenge or login for a user/organization).
We need to further discuss how we want to aggregate data and how we want to update them. Milen, Bruce and Jay may also provide valuable feedback.
@rrchai @vpchung This diagram shows the architecture I have in mind for pulling challenges and bring them to the OpenChallenges DB. The idea is that candidate challenges would be stored by a data curator service. This curator service would talk to the challenge service and organization service to know what resources are effectively in the so-called OpenChallenges DB. To enable full text search and advanced search features, approved challenges would be stored in Elasticsearch, which would then power a search service.
I will present you this architecture more in depth in our next meeting but comments and questions are already welcome.
This architecture diagram captures the entire data processing workflow.
The remaining piece of work is the creation of a search service that leverage Elasticsearch, ideally instead of the combo SQL + Elasticsearch as implemented by the challenge and organization services.
We make good progress on architecting how the data will flow from the different data sources to the OC DB.
My next highest priority is to develop the search service where challenge and organization data will meet to power the different queries we have discussed over the past few months.
Moving to Sprint 23.03
We have a draft of the workflow in Lucidchart, which needs to be cleaned up. I won't have time to work on this Epic this in April but hopefully I can come back to it end of May or June.
Adding tentatively to May sprint.
Return to Backlog
The goal of this Epic is to design and develop the challenge data ingress workflow that we will use to populate the challenge registry until our UI and REST API support write operations. These features should not be developed before 2023 Q2.
We need to identify how to perform the following operations:
Tasks