Open tschaffter opened 1 year ago
@tschaffter created an event
model for this.
Added to Sprint 23.03
3/16 update: I reviewed the use cases and am onboard with our proposed strategies 🚀
We have a draft of the event data model. This week, I will try to finalize it so that we have a clear picture of how Kaggle data will be processed. Next sprint, this event data model will be applied to another 1-2 data sources to see if it is robust and how it should be adapted to support these new data sources.
I will resume this work in May.
Returning to Backlog
Added to Sprint 23.10. Tentatively keeping Verena and I but this could also be assigned to Gaia.
Added to Backlog
We have a microservice that can pulls Kaggle Competitions using Kaggle API and can send them to the Kafka cluster.
The goal of this Story is to devise a workflow (architecture) for:
Collecting Kaggle Competitions
The simplest approach would be to collect all Kaggle competition at regular interval, then push them to the Kafka cluster. It is then up to a downstream service to decide what to do with the competition, e.g. filter out the one that are already in OpenChallenges DB.
Another approach would be to pull only the Kaggle competitions that have been created or updated in the last N minutes, and then run this task every N minutes. Kaggle API does not provide fine-grained option to pull the challenges we are interested in, so we could just pull everything and then filter in the serve. For references, here is an example of Kaggle competition object:
We could use enabled the property
enabledDate
, which is probably when the Competition became publicly visible, to decide what competitions to send further in the workflow.Note that the object does not include a property similar to
updatedAt
, so we would miss these challenges if we also want to capture updated challenges.Actually, can we specify to Kaggle API the fields that we want to collect? Does Kaggle API returns all the fields by default?
We likely need to compare received Kaggle competition with those that we have in the OpenChallenges database. For the sake of keeping the infrastructure modular, let's limit the task of the Kaggle to Kafka service to pulling all Kaggle competitions and sending them to the Kafka cluster, then have at least another service that further process the data.
Processing Kaggle competitions
The mapping from Kaggle schema(s) (archive and/or API) will be identified and documented in #1251.
One idea suggested in the above ticket is to store the raw Kaggle objects into a document database such as MongoDB or Elasticsearch. That way we can later decide to make use of information from the Kaggle object that we are currently not using.
The Kaggle challenge objects still need to be converted to OpenChallenges schema in order to be added to the OpenChallenges DB. We have two strategies:
I believe that it is a good idea to enable a human to review new challenge objects before they are added to the DB, at least as we work with a new organization and are figuring out the mapping and whether some fields may have missing information. For a given organization, we could later disable the human review once a satisfying quality has been achieved.
Note that we don't expect challenges to be created everyday, so the expected volume should be low enough so that it's OK to have someone reviewing them regularly (I could put in place an email notification system at some point).
We should also consider the case where information about a challenge that is already in the OpenChallenges DB has changed. We should be careful when updating a challenge as this challenge is referenced in other OpenChallenges data schemas. Maybe we can update automatically some fields that have changed (e.g. challenge description, number of participant, start/end date, etc.) but request human review when other fields changes (e.g. challenge name that may impact the challenge slug?).
Practically, a service would map Kaggle challenge objects to the OpenChallenges challenge objects and store them in a table. The challenges in this table would need to be reviewed, potentially manually updated and approved before a given challenge is added to the OpenChallenges DB. This operations could be performed using an admin dashboard that is only accessible to OpenChallenges admins. We may consider in the future bring this feature to the public app and enable organization to curate themselves the information.