PyProphet 2 - Githubissues

grosenberger commented 6 years ago

This PR contains the following changes:

General:

CLI implemented using Python Click
Support for Python 3
Integration with Travis-CI

Data formats:

Support for OSW SQLite files
Legacy support for OpenSWATH TSV files: Only MS2-level scoring available
Export of legacy TSV files
Export of matrix files
Export of score plots

Semi-supervised learning:

Learning and scoring on MS2-, MS1- and transition-levels supported.

Statistics:

Estimation of pi0 using Storey's approach by sampling a specified lambda range (validated against Bioconductor/qvalue)
Estimation of q-value using Storey's approach (numerically slightly different to mProphet; validated against Bioconductor/qvalue)

IPF:

IPF scoring
Native computation of posterior error probabilities (QVALITY dependency removed)

Jumbo-PyProphet:

Merging and/or subsampling of multiple OSW files
Assessement of peptide and protein-level error rates in run-specific, experiment-wide and global contexts

Other:

PyProphet scoring can be switched against Percolator using PercolatorAdapter from OpenMS/develop.

Limitations and future improvements:

Merging can be slow and needs to be improved
Improve robustness for exotic combinations
Some components (e.g. sqMass filtering) need more testing and validation

hroest commented 6 years ago

============== 7 failed, 35 passed, 1 warnings in 878.86 seconds ===============

grosenberger commented 6 years ago

The regression tests need some further fixes...

hroest commented 6 years ago

After subsampled learning, the main limitations remains the application of the scores. Currently, the full table, which can contain hundreds of runs and millions of features, is loaded into memory and scored. This requires lots of memory and would be the first thing to optimize, e.g. by iterative scoring of the runs.

I agree, once we have the weights why not load chunks of data (and only the scores, not the whole meta data) into memory?

hroest commented 6 years ago

This is a really huge PR, basically it changes everything. Maybe instead we can try this: Can you comment on what does not change, e.g. are simple legacy workflows still supported and give the same results (e.g. running pyprophet on a single file) ?

grosenberger commented 6 years ago

I agree, once we have the weights why not load chunks of data (and only the scores, not the whole meta data) into memory?

At least the skipping of metadata optimization is already implemented. However, the tables are still too big, which is why further optimization is required.

This is a really huge PR, basically it changes everything. Maybe instead we can try this: Can you comment on what does not change, e.g. are simple legacy workflows still supported and give the same results (e.g. running pyprophet on a single file) ?

Legacy TSV-based workflows are still supported. OpenSWATH TSV input will result in very similar results as before. However, there are some slight changes, due to the new functions, including lambda and PEP estimation. Further, I changed the implementation of the statistic functions, resulting in numerical differences. I validated the new functions against Bioconductor/qvalue, but I think I should point out the differences to the old implementation better.

hroest commented 6 years ago

I am basically in favour, however I did not do a in-depth code review of all the changes (there are too many). However this looks like a great improvement and from my side we can go ahead and merge

grosenberger commented 6 years ago

Great thanks. If there are no further comments, I will merge tonight.

PyProphet / pyprophet

PyProphet 2 #16