Closed grosenberger closed 6 years ago
============== 7 failed, 35 passed, 1 warnings in 878.86 seconds ===============
The regression tests need some further fixes...
After subsampled learning, the main limitations remains the application of the scores. Currently, the full table, which can contain hundreds of runs and millions of features, is loaded into memory and scored. This requires lots of memory and would be the first thing to optimize, e.g. by iterative scoring of the runs.
I agree, once we have the weights why not load chunks of data (and only the scores, not the whole meta data) into memory?
This is a really huge PR, basically it changes everything. Maybe instead we can try this: Can you comment on what does not change, e.g. are simple legacy workflows still supported and give the same results (e.g. running pyprophet on a single file) ?
I agree, once we have the weights why not load chunks of data (and only the scores, not the whole meta data) into memory?
At least the skipping of metadata optimization is already implemented. However, the tables are still too big, which is why further optimization is required.
This is a really huge PR, basically it changes everything. Maybe instead we can try this: Can you comment on what does not change, e.g. are simple legacy workflows still supported and give the same results (e.g. running pyprophet on a single file) ?
Legacy TSV-based workflows are still supported. OpenSWATH TSV input will result in very similar results as before. However, there are some slight changes, due to the new functions, including lambda and PEP estimation. Further, I changed the implementation of the statistic functions, resulting in numerical differences. I validated the new functions against Bioconductor/qvalue, but I think I should point out the differences to the old implementation better.
I am basically in favour, however I did not do a in-depth code review of all the changes (there are too many). However this looks like a great improvement and from my side we can go ahead and merge
Great thanks. If there are no further comments, I will merge tonight.
This PR contains the following changes:
General:
Data formats:
Semi-supervised learning:
Statistics:
IPF:
Jumbo-PyProphet:
Other:
Limitations and future improvements: