cognoma / core-service

Cognoma Core API
Other
9 stars 12 forks source link

Report figshare data version in notebook output #44

Open awm33 opened 7 years ago

awm33 commented 7 years ago

Track which version of the data (figshare or cancer data sha) that was used for a classifer

awm33 commented 7 years ago

@dhimmel and/or @cgreene Do you have any thoughts on the best way to handle versioning within the data loader? We currently use master from the cancer-data repo and a hard coded URL for the mutation data stored on figshare.

dhimmel commented 7 years ago

The data from figshare has versions. Therefore, it'd be ideal to specify a version and then download everything we need corresponding to that version. This is what machine-learning currently does.

The cognoml package has code for retrieving figshare data (currently using a class, previously via functions). We were hoping to move figshare logic to cognoma/figshare (although we never decided what exactly to do).

What data is needed from GitHub? We should just upload that to figshare so it can use the common versioning system.

cgreene commented 6 years ago

@dhimmel / @gwaygenomics : is this complete? I think that the ml-workers appear to be downloading whatever the latest figshare version is. Does that get reported to the users?

dhimmel commented 6 years ago

Does that get reported to the users?

I don't think it does. I am not sure whether core-service is even storing which figshare version is loaded. The source code for downloading the data is:

https://github.com/cognoma/core-service/blob/b9b2e4f37ac5b250d53ba30c08d94bf38d89c2dd/api/management/commands/acquiredata.py#L21-L39

So it's using the latest from GitHub for all files besides mutation-matrix.tsv.bz2 in which case it hardlinks to the figshare version 6 file. Instead, I think we should have core service specify a specific figshare version and github commit. If we note the core service commit hash in the output notebook, this would be sufficient to lookup the data versions? (assuming whenever the core-service codebase gets update, the database is reloaded... not sure)

BTW the figshare has been downloaded 41,471 times. Either people are using this a lot (or more likely we're requesting it an insane number of times :smile_cat:

cgreene commented 6 years ago

If we could reconstruct those URLs and put them into the notebook template, that's probably the best way. We'd like users to be able to reproduce the analysis and I think this key ingredient (the exact right data) is missing.