Sage-Bionetworks / sage-monorepo

Where OpenChallenges, Schematic, and other Sage open source apps are built
https://sage-bionetworks.github.io/sage-monorepo/
Apache License 2.0
23 stars 12 forks source link

[Story] Identify a strategy for pulling Kaggle Competitions and add them OpenChallenge DB #1238

Open tschaffter opened 1 year ago

tschaffter commented 1 year ago

We have a microservice that can pulls Kaggle Competitions using Kaggle API and can send them to the Kafka cluster.

The goal of this Story is to devise a workflow (architecture) for:

Collecting Kaggle Competitions

The simplest approach would be to collect all Kaggle competition at regular interval, then push them to the Kafka cluster. It is then up to a downstream service to decide what to do with the competition, e.g. filter out the one that are already in OpenChallenges DB.

Another approach would be to pull only the Kaggle competitions that have been created or updated in the last N minutes, and then run this task every N minutes. Kaggle API does not provide fine-grained option to pull the challenges we are interested in, so we could just pull everything and then filter in the serve. For references, here is an example of Kaggle competition object:

  {
    "titleNullable": "1st and Future - Player Contact Detection",
    "urlNullable": "https://www.kaggle.com/competitions/nfl-player-contact-detection",
    "descriptionNullable": "Detect Player Contacts from Sensor and Video Data",
    "organizationNameNullable": "The National Football League",
    "organizationRefNullable": null,
    "categoryNullable": "Featured",
    "rewardNullable": "$100,000",
    "userRankNullable": null,
    "maxTeamSizeNullable": 5,
    "evaluationMetricNullable": "Matthews correlation coefficient",
    "id": 40277,
    "ref": "https://www.kaggle.com/competitions/nfl-player-contact-detection",
    "title": "1st and Future - Player Contact Detection",
    "hasTitle": true,
    "url": "https://www.kaggle.com/competitions/nfl-player-contact-detection",
    "hasUrl": true,
    "description": "Detect Player Contacts from Sensor and Video Data",
    "hasDescription": true,
    "organizationName": "The National Football League",
    "hasOrganizationName": true,
    "organizationRef": "",
    "hasOrganizationRef": false,
    "category": "Featured",
    "hasCategory": true,
    "reward": "$100,000",
    "hasReward": true,
    "tags": [
      {
        "nameNullable": "health",
        "descriptionNullable": "Consider the health tag your data science gym. Get in there and work out those data science muscles on health analytics. Analyze heart disease until you sweat. Then recover with with a nice candy production dataset.",
        "fullPathNullable": "subject \u003e health and fitness \u003e health",
        "ref": "health",
        "name": "health",
        "hasName": true,
        "description": "Consider the health tag your data science gym. Get in there and work out those data science muscles on health analytics. Analyze heart disease until you sweat. Then recover with with a nice candy production dataset.",
        "hasDescription": true,
        "fullPath": "subject \u003e health and fitness \u003e health",
        "hasFullPath": true,
        "competitionCount": 5,
        "datasetCount": 7675,
        "scriptCount": 7488,
        "totalCount": 15168
      },
      {
        "nameNullable": "football",
        "descriptionNullable": "Some call it association football, some call it soccer, most call it sport ball. Come analyze the teams and players of the beautiful game.",
        "fullPathNullable": "subject \u003e health and fitness \u003e exercise \u003e sports \u003e football",
        "ref": "football",
        "name": "football",
        "hasName": true,
        "description": "Some call it association football, some call it soccer, most call it sport ball. Come analyze the teams and players of the beautiful game.",
        "hasDescription": true,
        "fullPath": "subject \u003e health and fitness \u003e exercise \u003e sports \u003e football",
        "hasFullPath": true,
        "competitionCount": 9,
        "datasetCount": 2075,
        "scriptCount": 865,
        "totalCount": 2949
      },
      {
        "nameNullable": "video data",
        "descriptionNullable": "",
        "fullPathNullable": "data type \u003e video data",
        "ref": "video data",
        "name": "video data",
        "hasName": true,
        "description": "",
        "hasDescription": true,
        "fullPath": "data type \u003e video data",
        "hasFullPath": true,
        "competitionCount": 8,
        "datasetCount": 255,
        "scriptCount": 123,
        "totalCount": 386
      },
      {
        "nameNullable": "tabular",
        "descriptionNullable": "",
        "fullPathNullable": "data type \u003e tabular",
        "ref": "tabular",
        "name": "tabular",
        "hasName": true,
        "description": "",
        "hasDescription": true,
        "fullPath": "data type \u003e tabular",
        "hasFullPath": true,
        "competitionCount": 2660,
        "datasetCount": 5667,
        "scriptCount": 4799,
        "totalCount": 13126
      }
    ],
    "deadline": "2023-03-01T23:59:00Z",
    "kernelCount": 0,
    "teamCount": 526,
    "userHasEntered": false,
    "userRank": 0,
    "hasUserRank": false,
    "mergerDeadline": "2023-02-22T23:59:00Z",
    "newEntrantDeadline": "2023-02-22T23:59:00Z",
    "enabledDate": "2022-12-05T20:38:03.563Z",
    "maxDailySubmissions": 5,
    "maxTeamSize": 5,
    "hasMaxTeamSize": true,
    "evaluationMetric": "Matthews correlation coefficient",
    "hasEvaluationMetric": true,
    "awardsPoints": true,
    "isKernelsSubmissionsOnly": true,
    "submissionsDisabled": false
  },

We could use enabled the property enabledDate, which is probably when the Competition became publicly visible, to decide what competitions to send further in the workflow.

Note that the object does not include a property similar to updatedAt, so we would miss these challenges if we also want to capture updated challenges.

Actually, can we specify to Kaggle API the fields that we want to collect? Does Kaggle API returns all the fields by default?

We likely need to compare received Kaggle competition with those that we have in the OpenChallenges database. For the sake of keeping the infrastructure modular, let's limit the task of the Kaggle to Kafka service to pulling all Kaggle competitions and sending them to the Kafka cluster, then have at least another service that further process the data.

Processing Kaggle competitions

The mapping from Kaggle schema(s) (archive and/or API) will be identified and documented in #1251.

One idea suggested in the above ticket is to store the raw Kaggle objects into a document database such as MongoDB or Elasticsearch. That way we can later decide to make use of information from the Kaggle object that we are currently not using.

The Kaggle challenge objects still need to be converted to OpenChallenges schema in order to be added to the OpenChallenges DB. We have two strategies:

I believe that it is a good idea to enable a human to review new challenge objects before they are added to the DB, at least as we work with a new organization and are figuring out the mapping and whether some fields may have missing information. For a given organization, we could later disable the human review once a satisfying quality has been achieved.

Note that we don't expect challenges to be created everyday, so the expected volume should be low enough so that it's OK to have someone reviewing them regularly (I could put in place an email notification system at some point).

We should also consider the case where information about a challenge that is already in the OpenChallenges DB has changed. We should be careful when updating a challenge as this challenge is referenced in other OpenChallenges data schemas. Maybe we can update automatically some fields that have changed (e.g. challenge description, number of participant, start/end date, etc.) but request human review when other fields changes (e.g. challenge name that may impact the challenge slug?).

Practically, a service would map Kaggle challenge objects to the OpenChallenges challenge objects and store them in a table. The challenges in this table would need to be reviewed, potentially manually updated and approved before a given challenge is added to the OpenChallenges DB. This operations could be performed using an admin dashboard that is only accessible to OpenChallenges admins. We may consider in the future bring this feature to the public app and enable organization to curate themselves the information.

vpchung commented 1 year ago

@tschaffter created an event model for this.

tschaffter commented 1 year ago

Added to Sprint 23.03

vpchung commented 1 year ago

3/16 update: I reviewed the use cases and am onboard with our proposed strategies 🚀

tschaffter commented 1 year ago

Weekly update

We have a draft of the event data model. This week, I will try to finalize it so that we have a clear picture of how Kaggle data will be processed. Next sprint, this event data model will be applied to another 1-2 data sources to see if it is robust and how it should be adapted to support these new data sources.

tschaffter commented 1 year ago

Weekly Update

I will resume this work in May.

tschaffter commented 1 year ago

Returning to Backlog

tschaffter commented 1 year ago

Added to Sprint 23.10. Tentatively keeping Verena and I but this could also be assigned to Gaia.

tschaffter commented 1 year ago

Added to Backlog