PyProphet / pyprophet

PyProphet: Semi-supervised learning and scoring of OpenSWATH results.
http://www.openswath.org
BSD 3-Clause "New" or "Revised" License
29 stars 21 forks source link

PyProphet 2 #16

Closed grosenberger closed 6 years ago

grosenberger commented 6 years ago

This PR contains the following changes:

General:

Data formats:

Semi-supervised learning:

Statistics:

IPF:

Jumbo-PyProphet:

Other:

Limitations and future improvements:

hroest commented 6 years ago
============== 7 failed, 35 passed, 1 warnings in 878.86 seconds ===============
grosenberger commented 6 years ago

The regression tests need some further fixes...

hroest commented 6 years ago

After subsampled learning, the main limitations remains the application of the scores. Currently, the full table, which can contain hundreds of runs and millions of features, is loaded into memory and scored. This requires lots of memory and would be the first thing to optimize, e.g. by iterative scoring of the runs.

I agree, once we have the weights why not load chunks of data (and only the scores, not the whole meta data) into memory?

hroest commented 6 years ago

This is a really huge PR, basically it changes everything. Maybe instead we can try this: Can you comment on what does not change, e.g. are simple legacy workflows still supported and give the same results (e.g. running pyprophet on a single file) ?

grosenberger commented 6 years ago

I agree, once we have the weights why not load chunks of data (and only the scores, not the whole meta data) into memory?

At least the skipping of metadata optimization is already implemented. However, the tables are still too big, which is why further optimization is required.

This is a really huge PR, basically it changes everything. Maybe instead we can try this: Can you comment on what does not change, e.g. are simple legacy workflows still supported and give the same results (e.g. running pyprophet on a single file) ?

Legacy TSV-based workflows are still supported. OpenSWATH TSV input will result in very similar results as before. However, there are some slight changes, due to the new functions, including lambda and PEP estimation. Further, I changed the implementation of the statistic functions, resulting in numerical differences. I validated the new functions against Bioconductor/qvalue, but I think I should point out the differences to the old implementation better.

hroest commented 6 years ago

I am basically in favour, however I did not do a in-depth code review of all the changes (there are too many). However this looks like a great improvement and from my side we can go ahead and merge

grosenberger commented 6 years ago

Great thanks. If there are no further comments, I will merge tonight.