Deploy challenge infrastructure

tschaffter commented 4 years ago

Deploy model-to-data configuration on UW site. Start from the configuration used for the EHR DREAM Challenge: Patient Mortality Prediction.

tschaffter commented 4 years ago

Specifications

"Live" leaderboard (updatedd 2-3 times per week)
As the dataset will be growing on a regular basis, we want to be able to re-score all the submissions that have been previously submitted (while still keeping an archive of the past scores).
Participants will submit link to GitHub repositories. In order to process private repo, we need to ask participants to invite a GH Machine User that we will create for this challenge.

thomasyu888 commented 4 years ago

Thanks @tschaffter for the summary. I have a few questions about this.

Should the CWL tools + workflows live here? Will the infrastructure be run in UW environment?
With respect to re-scoring. I think the best bet is for us to "resubmit" all valid models at the time new data comes in, which will add a new entry to queue. I am open to also continuously adding annotations to a submission, but we have run into difficulty re-running internal queue submissions.
Is the motivation for submitting github repositories just to collect their method writeups? I think it is a security issue if we are building participant Dockerfile (even more so than executing participant Docker images).

tschaffter commented 4 years ago

@thomasyu888

Yes, they can. This repo is public. It may also be used for other purpose like hosting the baseline method.
I propose to start initially with one public submission queue that redirects submissions to an internal queue. When a new release of the dataset is available, we would create another internal submission queue and have the public queue pointing to it. We would name the internal submission queues with a reference to the "version number" of the dataset. Each time that we release a new version of the dataset, we add the submissions from the previous queue to the new queue. This way ALL past submissions will be evaluated on ALL the future datasets. As the computational burden will increase with the number of submission and dataset releases, we should define a system where both us and users could "retire" submissions that we think are no longer relevant. Note that across all the submissions of a given team made on the version N of the dataset, it's not necessarily the submission that achieve the best performance on N that will achieve the best performance on N+1. All the internal queues created should stay active and pointing to archived versions of the dataset on the target server running the submissions. This will enable us to reproduce the results of any submission if need for the sake or reproducibility. Tim would like to use this mechanisms to also evaluate new submissions to past version of the dataset. While the extra information could be useful, I would like to make two comments: 1) there is no grantee that a new method will run on an older version of the dataset (format and content of the dataset may change over time) and 2) this would lead to an explosion of the computation resources needed. Once we have identified that we actually need this extra information, we could then try to run a few selected models on all versions of the dataset while still keeping in mind the two comments I made above.
I would like to promiote the use of GitHub repo as the default submission type. There see several advantages but would like to further discuss potential issues. 1. This allows us to get access to the entire codebase of the submission that we know is working as we are using it to build a Docker image and run it (reproducibility). 2) The codebase stays where it has been developed, which once made publicly available, provides access to additional information like GitHub tickets created during the development and past commits, which can provide useful information on the past strategies that a team may have explored and not only their best performing strategy. 3) Once released publicly, the community will be redirected to the GitHub repo of the team, thus shedding more light on the developers of the method (visibility).

thomasyu888 commented 4 years ago

@tschaffter Thanks for your comments.

I will add the CWL files here
Are we planning on running the infrastructure in our AWS or will this run in UW? What you proposed make sense, but I can definitely see this getting out of control really quick with computation. We need to have a strategy to pick which models to run on the new queues, but the architecture can be similar to what I have built recently. I will make a diagram to see if you agree.
I really like the idea of using GitHub repos as submissions, and I agree with all of your points, but I have many comments about this.
- GitHub submissions can replace our "writeup submission". That being said, there is absolutely no guarantee that people won't delete their repo, so we would need to fork their repo. (Would we create a GitHub organization per challenge?) (Edit: Deleting a private repository will delete all of its forks. Deleting a public repository will not delete its forks.)
- Before making GitHub repos as a default for model-to-data challenges, we should also think about the impact of this decision for future challenges and the people running challenges - I would like buy in and opinion from more people. I'm all for promoting standards, but adding another technology isn't always easy. (Case in point: our own internal GitHub provisioning system)
- In terms of a security standpoint, I do not recommend building other people's Dockerfiles. I want to even say I am strongly against this. The main reason for this is because it would require network access to build other people's Dockerfiles and that introduces pretty big security risks. That being said, after some internal discussion, there is one route we can take - using the DockerHub automated build system. Participant submissions would then be built on an external resource and not in our compute environment. That being said, I think it would quite some time to achieve this programmatically? Are we going to have participants use Dockerhub automated builds? Already we have another layer of complexity.
- If we are pushing people to use GitHub, then what is the motivation for synapse projects? Currently, people would still need synapse projects to submit to a challenge. I am aware there are ways around this, but this should be more thoroughly thought through as well.

I'm sure I will have more thoughts about this as I think about it more, but with everything that has been pointed out above:

When do we plan on starting this challenge? (collection of submissions)
Depending on the answer above, I recommend potentially using GitHub repos to collect method descriptions and code as a starting point, so that we have more time to think about the issues above. We can then also see how much push back from participants about putting their method writeup and code in GitHub

vpchung commented 4 years ago

Inserting myself here because I'd like to stay up-to-date with the challenge! :)

Also pretty interested in this new approach to the infra with GH submissions. Sounds promising but I'm a little concern that this may add complexity to an already complex system? Even now, a good number of participants are still having trouble with Docker submissions. I'd be worried that requiring them to learn and use git could add another hurdle, especially for our biology-focused participants... But regardless! I'm very interested in how this plays out.

tschaffter commented 4 years ago

@vpchung It's a great question and a concern we should keep in mind. The use of a technology like GitHub may vary from one community to another. It would be interesting to interrogate the community of participants to the EHR DREAM Challenge on their experience using Docker and GitHub. We have an opportunity to do so as part of the questionnaire that we will send them shortly after posting the final results. My guess is that most of the developers who use Docker also have prior experience with Git and would be able to quickly grasp its basic usage.

Sage has a mission to promote the use of best coding practices. Developing and sharing code using Git is an important one. Git enables reproducibility and increase the visibility of researchers' work. From now on, a short video tutorial should accompany the launch of future challenges to show participants every step of the submission process.

Given the imminent launch of this challenge, we should reuse the infrastructure of the Patient Mortality Challenge as it is (e.g. Docker submissions) and start accepting Git-based submissions after the launch of the challenge. I have two questions:

How complicated is it to configure the submission workflow to take as input a git repo URL, build a docker image using a Dockerfile included in the repo before running the image on data?
What are the security risks of implementing 1.?

thomasyu888 commented 4 years ago

Lets take the discussion of GH submissions offline. I have prepared a GH submission proposal document which we can look through.

That being said I want to talk about the infrastructure.

Accepts docker images as submission
Runs through synthetic dataset (train/infer)
Submits to internal queue
Runs through internal train/infer
validate and scores

New features:

Participants now only have to look at one leaderboard/dashboard instead of two, this will also come in handy if we need to add more machines on UW side to support this challenge)
All the logs can be uploaded into one folder (Since we aren't uploading logs for internal queues, this isn't a priority)

tschaffter commented 4 years ago

Update

See Lucidchart

tschaffter commented 4 years ago

Update

Based on yesterday's discussion with Justin and Sean:

Submission content

1 model
1 write-up

Submission quota

up to 1 successful submission / team / day
up to 10 submissions / team / day (limit spam of submissions that do not complete)

Results

Results scores stored with three decimals in submission metadata
Results table Published once a week
Will attempt to get the Bayes BootLadderBoot (BBLB) ready as soon as possible, but given the low amount of information returned (cropped scores), it should be OK to return systematically the performance of the submissions to the submitter during the first couple of weeks after the launch of the challenge.

thomasyu888 commented 4 years ago

@tschaffter

Submission Quota

The submission quota you are requesting is not possible (with synapse). We can most definitely limit the number of successful submissions to 1, but that means people can only submit one at a time. I would have to write code to check the number of submissions a person has made in a day INVALID .

I recommend going 1 submission a route (people won't be able to submit once they've submitted until their model errors out)

Results

Did you want only the results of that week?

tschaffter commented 4 years ago

@thomasyu888 I've added the section Submission content to my update. Let's discuss this and your questions at 1 pm.

tschaffter commented 4 years ago

@thomasyu888

Format validated by Justin.

{
  "docker": "docker.synapse.org/my-image@sha....",
  "description": "My awesome model does X and Y",
  "ranked_features": [
    "age",
    "gender"
  ],
  "references": [
    "https://github.com/me/my-project",
  ]
}

thomasyu888 commented 4 years ago

@tschaffter, Thanks

Do you have a list of ranked_features so we can validate that value?
I'm just going to validate that description isn't empty, but it can technically be anything
references is going to be a bit trickier, so Im going to opt to not check for the permissions and have this be a manual process for the initial implementation

I confirmed a couple of things. shadigest never goes away even if participants continuously push over tag. That being said, if a participant decides to delete the repository itself, we wouldn't have a copy of it. So would we copy their docker?

tschaffter commented 4 years ago

@thomasyu888

Do you have a list of ranked_features so we can validate that value?

The values are not from a set so we can't validate them.

I'm just going to validate that description isn't empty, but it can technically be anything

Here is a proposed validation of the file format:

No need to check the format of the Docker image. The validation should fail later if we can't pull the image.
Description should not be empty
ranked_features and references properties must be defined but can be empty

So would we copy their docker?

Yes, we want to keep a copy of any Docker image that goes to UW and run on EHR. We are mainly interested in the image that run successfully on the data BUT for security/monitoring/tracing purpose, we want to keep a copy of anything that goes to UW (the Docker image).

Does that make sense?

thomasyu888 commented 4 years ago

@tschaffter

Thanks - I will work on this. One small difference is that I will most likely validate the existence of their docker image + sha digest prior to pulling.

tschaffter commented 4 years ago

@thomasyu888

Are you checking the sha digest to provide more detailed information to the user in case the submission if failing for this reason?

thomasyu888 commented 4 years ago

@tschaffter

I have code that checks if the image + sha-digest exists + if I have permission to view it. So if a participant has a typo or didn't give the correct permissions, the submission will be invalid.

tschaffter commented 4 years ago

I see now why this test is required: the main risk otherwise would be for the user to forget to add the sha-digest and so we would always score the latest version of his image (assuming we have access to the image).

thomasyu888 commented 4 years ago

I would prefer to not go that route, because people could have changed their "latest" image by the time we get to running their docker image. The only way we can be 100% certain about running a specific version of their model is to take the sha-digest (similar to what was done in the DM challenge and all other challenges).

The tricky part will be if participants delete their repo before we run their submission, then it will simply be invalid as it doesn't exist at all

tschaffter commented 4 years ago

I would prefer to not go that route

I was agreeing with you: checking that the sha-digest is specified is required to avoid any ambiguity

The tricky part will be if participants delete their repo before we run their submission, then it will simply be invalid as it doesn't exist at all

This is acceptable. It's up to discussion but I think that the idea system could be to make a copy of all the required resources upon submission to better match what the user may think ("I have sent my submission, it's done, the organizer have everything required to run it"). Let's add this point to future discussion about the challenge platform.

thomasyu888 commented 4 years ago

Ah, sorry for misunderstanding. Thanks. I will list out the steps of the workflow later today to see if you agree.

thomasyu888 commented 4 years ago

@tschaffter :

Synthetic Queue EC2

Here is the workflow for submissions:

get the submission id (main_subid) of main queue submission
Download json submission
Validate the json submission + check docker image exists/correct permissions
archive the docker (push into a private project to save the repo)
run training on synthetic data? (is there training)
run inference on synthetic data
upload prediction file
validate
submit the archived docker to internal queue (internal_subid)

UW Internal Queue EC2

get submission of (internal_subid) on UW infra
run the docker on UW infra
no docker logs are returned on internal data?
update the annotations for (main_subid) so participants only need to look at one table - scores, validation results, runtime

thomasyu888 commented 4 years ago

Initial thoughts about running submissions for new datasets

create tool to submit scored submissions to new datasets
annotate leaderboard with columns per dataset_score.

tschaffter commented 4 years ago

no training at this point

thomasyu888 commented 4 years ago

Waiting on synthetic dataset and baseline method to test the infra

tschaffter commented 4 years ago

Update

Infrastructure diagram (lucidchart)
Tom to work with Tim to deploy the submission pipeline

data2health / covid19-challenge