Closed tschaffter closed 4 years ago
Thanks @tschaffter for the summary. I have a few questions about this.
@thomasyu888
Yes, they can. This repo is public. It may also be used for other purpose like hosting the baseline method.
I propose to start initially with one public submission queue that redirects submissions to an internal queue. When a new release of the dataset is available, we would create another internal submission queue and have the public queue pointing to it. We would name the internal submission queues with a reference to the "version number" of the dataset. Each time that we release a new version of the dataset, we add the submissions from the previous queue to the new queue. This way ALL past submissions will be evaluated on ALL the future datasets. As the computational burden will increase with the number of submission and dataset releases, we should define a system where both us and users could "retire" submissions that we think are no longer relevant. Note that across all the submissions of a given team made on the version N of the dataset, it's not necessarily the submission that achieve the best performance on N that will achieve the best performance on N+1. All the internal queues created should stay active and pointing to archived versions of the dataset on the target server running the submissions. This will enable us to reproduce the results of any submission if need for the sake or reproducibility. Tim would like to use this mechanisms to also evaluate new submissions to past version of the dataset. While the extra information could be useful, I would like to make two comments: 1) there is no grantee that a new method will run on an older version of the dataset (format and content of the dataset may change over time) and 2) this would lead to an explosion of the computation resources needed. Once we have identified that we actually need this extra information, we could then try to run a few selected models on all versions of the dataset while still keeping in mind the two comments I made above.
I would like to promiote the use of GitHub repo as the default submission type. There see several advantages but would like to further discuss potential issues. 1. This allows us to get access to the entire codebase of the submission that we know is working as we are using it to build a Docker image and run it (reproducibility). 2) The codebase stays where it has been developed, which once made publicly available, provides access to additional information like GitHub tickets created during the development and past commits, which can provide useful information on the past strategies that a team may have explored and not only their best performing strategy. 3) Once released publicly, the community will be redirected to the GitHub repo of the team, thus shedding more light on the developers of the method (visibility).
@tschaffter Thanks for your comments.
I will add the CWL files here
Are we planning on running the infrastructure in our AWS or will this run in UW? What you proposed make sense, but I can definitely see this getting out of control really quick with computation. We need to have a strategy to pick which models to run on the new queues, but the architecture can be similar to what I have built recently. I will make a diagram to see if you agree.
I really like the idea of using GitHub repos as submissions, and I agree with all of your points, but I have many comments about this.
I'm sure I will have more thoughts about this as I think about it more, but with everything that has been pointed out above:
Inserting myself here because I'd like to stay up-to-date with the challenge! :)
Also pretty interested in this new approach to the infra with GH submissions. Sounds promising but I'm a little concern that this may add complexity to an already complex system? Even now, a good number of participants are still having trouble with Docker submissions. I'd be worried that requiring them to learn and use git could add another hurdle, especially for our biology-focused participants... But regardless! I'm very interested in how this plays out.
@vpchung It's a great question and a concern we should keep in mind. The use of a technology like GitHub may vary from one community to another. It would be interesting to interrogate the community of participants to the EHR DREAM Challenge on their experience using Docker and GitHub. We have an opportunity to do so as part of the questionnaire that we will send them shortly after posting the final results. My guess is that most of the developers who use Docker also have prior experience with Git and would be able to quickly grasp its basic usage.
Sage has a mission to promote the use of best coding practices. Developing and sharing code using Git is an important one. Git enables reproducibility and increase the visibility of researchers' work. From now on, a short video tutorial should accompany the launch of future challenges to show participants every step of the submission process.
Given the imminent launch of this challenge, we should reuse the infrastructure of the Patient Mortality Challenge as it is (e.g. Docker submissions) and start accepting Git-based submissions after the launch of the challenge. I have two questions:
Lets take the discussion of GH submissions offline. I have prepared a GH submission proposal document which we can look through.
That being said I want to talk about the infrastructure.
New features:
See Lucidchart
Based on yesterday's discussion with Justin and Sean:
@tschaffter
The submission quota you are requesting is not possible (with synapse). We can most definitely limit the number of successful submissions to 1, but that means people can only submit one at a time. I would have to write code to check the number of submissions a person has made in a day INVALID .
@thomasyu888 I've added the section Submission content
to my update. Let's discuss this and your questions at 1 pm.
@thomasyu888
Format validated by Justin.
{
"docker": "docker.synapse.org/my-image@sha....",
"description": "My awesome model does X and Y",
"ranked_features": [
"age",
"gender"
],
"references": [
"https://github.com/me/my-project",
]
}
@tschaffter, Thanks
description
isn't empty, but it can technically be anythingreferences
is going to be a bit trickier, so Im going to opt to not check for the permissions and have this be a manual process for the initial implementationI confirmed a couple of things. shadigest
never goes away even if participants continuously push over tag. That being said, if a participant decides to delete the repository itself, we wouldn't have a copy of it. So would we copy their docker?
@thomasyu888
Do you have a list of ranked_features so we can validate that value?
The values are not from a set so we can't validate them.
I'm just going to validate that description isn't empty, but it can technically be anything
Here is a proposed validation of the file format:
ranked_features
and references
properties must be defined but can be emptySo would we copy their docker?
Yes, we want to keep a copy of any Docker image that goes to UW and run on EHR. We are mainly interested in the image that run successfully on the data BUT for security/monitoring/tracing purpose, we want to keep a copy of anything that goes to UW (the Docker image).
Does that make sense?
@tschaffter
Thanks - I will work on this. One small difference is that I will most likely validate the existence of their docker image + sha digest prior to pulling.
@thomasyu888
Are you checking the sha digest to provide more detailed information to the user in case the submission if failing for this reason?
@tschaffter
I have code that checks if the image + sha-digest exists + if I have permission to view it. So if a participant has a typo or didn't give the correct permissions, the submission will be invalid.
I see now why this test is required: the main risk otherwise would be for the user to forget to add the sha-digest and so we would always score the latest version of his image (assuming we have access to the image).
I would prefer to not go that route, because people could have changed their "latest" image by the time we get to running their docker image. The only way we can be 100% certain about running a specific version of their model is to take the sha-digest (similar to what was done in the DM challenge and all other challenges).
The tricky part will be if participants delete their repo before we run their submission, then it will simply be invalid as it doesn't exist at all
I would prefer to not go that route
I was agreeing with you: checking that the sha-digest is specified is required to avoid any ambiguity
The tricky part will be if participants delete their repo before we run their submission, then it will simply be invalid as it doesn't exist at all
This is acceptable. It's up to discussion but I think that the idea system could be to make a copy of all the required resources upon submission to better match what the user may think ("I have sent my submission, it's done, the organizer have everything required to run it"). Let's add this point to future discussion about the challenge platform.
Ah, sorry for misunderstanding. Thanks. I will list out the steps of the workflow later today to see if you agree.
@tschaffter :
Synthetic Queue EC2
Here is the workflow for submissions:
UW Internal Queue EC2
Initial thoughts about running submissions for new datasets
Waiting on synthetic dataset and baseline method to test the infra
Deploy model-to-data configuration on UW site. Start from the configuration used for the EHR DREAM Challenge: Patient Mortality Prediction.