Editorial guidelines for machine learning results

bemoody commented 5 years ago

A trained neural network (for example) is not source code.

It's also not data, in the sense of being a collection of objective, verifiable facts.

In my not-so-humble opinion, if the training process for a machine-learned model is not reproducible, the result cannot be called scientific.

Nonetheless, the published model can in theory be a basis for future scientific research, if it is rigorously validated.

In any case, I think we all agree that we want to encourage publication of these results on PhysioNet, and as such, we ought to establish formal editorial guidelines. Some ideas:

that all inputs must be fully documented and publicly available
that the training process must be fully documented (in the form of free and open source code)
that the output must be in an open format that is usable by free and open-source tools
that the use of reproducible processes is strongly encouraged, if not required

These are arguably a sub-category of "software", but we might also consider treating them as an entirely separate category of project.

I'd welcome any thoughts, in particular, from @rgmark and @gariclifford.

tompollard commented 5 years ago

@alistairewj @jraffa @tnaumann any thoughts on Benjamin's comment above? Should we introduce a new project type for PhysioNet (alongside data, software, challenge) to handle pre-trained models?

jraffa commented 5 years ago

In the past, have people used PhysioNet as their primary venue to distribute ML models?

bemoody commented 5 years ago

It's something that didn't come up often in the past, but has been increasingly prevalent in the last few Challenges.

We also mentioned this in the latest grant rebuttal, as being an area where we're hoping to expand.

alistairewj commented 5 years ago

The recently proposed Montreal data license discusses "representations", and how these representations are licensed. It might be a good framework for thinking about these submissions. I think in general we want to support useful derivations, but it doesn't seem necessary to limit it to those derived by ML.

tompollard commented 5 years ago

As we have a couple of models in the system already, it would be good to decide on how we want to treat them soonish. My vote is to stick with the current project types and generally treat models as software because:

models are typically connected to a programming language or framework in some way (unlike data, which is language independent assuming we stick to our open requirements).
the standard metadata fields seems appropriate: (e.g. Background, Description, Implementation, Installation and Usage Notes).

We can use keywords/tags to support custom searches etc for models.

tnaumann commented 5 years ago

A few considerations:

Re: reproducibility. There's a much larger discussion to be had here, but I think it's necessary to differentiate between a training process (i.e., the code that specifies how a model is trained) and a model that was trained (i.e., the artifact generated by training). Given a training process it is not necessarily possible to identically recreate the same model, even if one is diligent; e.g., when using GPUs, most frameworks don't guarantee determinism (even if seeds are appropriately specified) [1].
Re: separate submission type. Code is data! Or it can be depending on perspective, and possibly feelings about Lisp. More seriously though, I think there's a reasonable argument that a model could be considered data. In some sense, a model is an object in some software framework that is serialized, possibly with an included runtime. With a runtime, the underlying intention might be to run it directly (e.g., like any other binary executable). However, I suspect this is less often what will be uploaded since source code is typically provided as well. Without a runtime, the underlying intention is to load this model into a framework and use it, similar to many other data types. It would seem the real reason to create a separate submission type is to 1) provide specific guidance with respect to submission expectations or 2) identify this type of submission for things like differing templates or optimized storage.
Re: a use case. I recently submitted a set of clinical embeddings generated by BERT [2]. Our hope with this project is that the embeddings are useful to other researchers, and save them a long time in generating the same set of resources. My assumption is that these will be loaded by other software, I submitted these as data, which seems reasonable. I did an (admittedly) terrible job providing the requested documentation---in part because I didn't want to just copy any paste the corresponding arXiv paper... Once that's resolved though, I hope they'll be included.
Re: input requirements. In principle, it would be nice if all inputs are publicly-available. That being said, I can easily imagine a situation where derived artifacts have intrinsic value, even if they cannot be reproduced. Consider, e.g., a situation where I work with Hospital X to train word embeddings. I may not be able to make the underlying identified data available via PhysioNet, but if I could make the embeddings available, it's very likely that someone else could derive value from them and/or provide comparison to other embeddings.
Re: free open-source tools. Again, in principle I agree with this. That being said, I could easily imagine a situation where derived artifacts have intrinsic value. Imagine running MIMIC through something like Amazon Comprehend Medical and uploading the results to PhysioNet. Obviously, their code can't be reproduced, but there's value in this if only because other researchers don't now need to spend money on running the same thing.

All this is really to say that I currently am leaning more "data" than "software" for most models. However, if others are leaning "software" over "data" then that might be a reason alone to have another category so that they all get grouped together at least.

[1] https://stackoverflow.com/questions/50744565/how-to-handle-non-determinism-when-training-on-a-gpu [2] https://github.com/google-research/bert

tompollard commented 5 years ago

@jraffa An example of a model distributed on PhysioNet is: https://physionet.org/works/MIMICIIIDerivedDataRepository/files/approved/what-is-in-a-note/

jraffa commented 5 years ago

Thanks @tompollard

A couple comments:

I worry about the models on PhysioNet being out of date (e.g., the author makes some refinement to their model, but does not update PhysioNet's version). Is the purpose of this to archive the 'paper version' of the model or to be an archive of the model itself? I believe the two instances for methods papers where I have software with the paper, the journal ended up hosting the paper's version of the code themselves.
In the case you posted, the authors have a script on their GitHub repo which downloads the word2vec file from their website. Presumably they can't use PhysioNet because that version is password protected.
I think it's important to look at what others are doing in this area. I believe in computer vision they share their models quite frequently. E.g. there's a list of them here, https://pjreddie.com/darknet/yolo/
We should talk about how we are going to distribute GOSSIS.

tompollard commented 5 years ago

Thanks Jesse, useful comments. I've added a couple of thoughts below:

I worry about the models on PhysioNet being out of date (e.g., the author makes some refinement to their model, but does not update PhysioNet's version). Is the purpose of this to archive the 'paper version' of the model or to be an archive of the model itself?

Eventually I'd like objects shared through PhysioNet to be "primary" research outputs, so published projects should stand alone. If someone develops an updated version of a model and they want to share it, then they should create a new version of the project (using the functionality implemented in #311). If a paper supplements the model, then it should provide a citation to the archived version.

In the case you posted, the authors have a script on their GitHub repo which downloads the word2vec file from their website. Presumably they can't use PhysioNet because that version is password protected.

There's nothing stopping us from allowing approved users from pulling code from a protected repository, and this is something that we'd like to build into a physionet package. In this specific case, I can't say why the authors download data from a different source, but it shouldn't be necessary.

I think it's important to look at what others are doing in this area. I believe in computer vision they share their models quite frequently. E.g. there's a list of them here, https://pjreddie.com/darknet/yolo/

Yep, good point, more research needed!

We should talk about how we are going to distribute GOSSIS.

Hopefully through PhysioNet! Is it data, software, or a model ;) ?

tnaumann commented 5 years ago

In the case you posted, the authors have a script on their GitHub repo which downloads the word2vec file from their website. Presumably they can't use PhysioNet because that version is password protected.

@jraffa The code was created uploaded before we posted the vectors to Physionet and we forgot to update it. When pulling things down from Physionet, specifying --user "$USER" --ask-password will allow you to authenticate and pull them down. I've updated this in the repo.

MIT-LCP / physionet-build

Editorial guidelines for machine learning results #332