bgruening / galaxytools

:microscope::books: Galaxy Tool wrappers
MIT License
115 stars 222 forks source link

Galaxy-Scikit Learn paper #756

Open jgoecks opened 6 years ago

jgoecks commented 6 years ago

@bgruening @qiagu

To publish I suggest that we perform two analyses:

  1. Reproduce the work in this paper. This is a large analysis that will exercise much of the Sk-Learn functionality in Galaxy. Here are datasets and scripts/Jupyter notebooks for this work.

  2. Perform an analysis that uses genomic analysis pipelines followed by Sk-Learn tools. One idea is doing drug response prediction using cancer cell lines from DepMap. If we were to do all of DepMep (1000+ cell lines) it would be a very large analysis; we could focus on a particular cohort (tissue), such as breast cancer cell lines. Another idea is to look at reproducing something in Kipoi, which is focused on functional genomics.

bgruening commented 6 years ago

Thanks for the summary @jgoecks!

qiagu commented 6 years ago

The 1st is the same as the one I shared with @bgruening 2 days ago.

qiagu commented 6 years ago

Things need to be done in my mind: 1) introduce the new dataset input system (including the column selector, header options and so on) to the old tools. 2) add pre-processing module into the pipeline of cross-validation 3) improve the UI of param_grid input box 4) make random search cross-validation possible 5) training materials 6) add reserve index option to the Load a model and predict section, maybe give a column name to prediction result as well.

bgruening commented 6 years ago

:+1: !!!

jgoecks commented 5 years ago

@qiagu

Here's my current list:

Edit: The last task isn't as important to me as the other two.

qiagu commented 5 years ago

The current tool set works mostly fine with the Penn datasets. For further improvement, below is my new work list:

jgoecks commented 5 years ago

Thanks @qiagu

IMO none of these are needed for submitting a manuscript. We should all try using the tools a bit and see if there are any huge issues that need to be addressed before submission.

After we finish with the Penn datasets, I would still like to do the DepMap analysis if possible. I will start and share a manuscript draft soon.

qiagu commented 5 years ago

@jgoecks The tests are almost done. I haven't found any huge issue, except conflicting/wrong search parameter combination, n_jobs crash, scoring support and other small things, all of which, I'd argue, are not big faults of our tools.

yvanlebras commented 5 years ago

Hi everyone, Just a comment to see updates and give feedback for ecology related use cases following Björn discussions. For sure my comments will not be relevant but in case of.... Testing for the first time the glm tool, I was searching a way to apply / test my own model on my data.... To do so, I was thinking I have to select for the first parameter "Load a model and predict" then, describe / select my model... but I can't create a model (I was thinking about a way to do that using Galaxy parameters (when tags + column selection for example ) to define something like $response_variable = $observed_variable1 + $observed_variable2 + ... + $observed_variablex + $observed_variable1 * $observed_variabley ... OR select a text file with a line explicitely describing my model but here the tool """"only"""" ask a model file.... through a zip datatype :( I understand that this zip datatype is just something like a hack to have a way to select a "Python sklearn dump" / object file.... but I think we need a way to create such a model / model file through a dedicated Galaxy tool and/or dedicated parameters on the glm tool form.... I hope this is quite understable and not out of scope... Thank you for this amazing scikit work!!!!!