PyProphet / pyprophet

PyProphet: Semi-supervised learning and scoring of OpenSWATH results.
http://www.openswath.org
BSD 3-Clause "New" or "Revised" License
29 stars 21 forks source link

Pyprophet protein error #38

Closed sian231 closed 5 years ago

sian231 commented 5 years ago

Hi, I am having some problems when attempting to run pyprophet protein. I have progressed through the OpenSwath Worfklow; merging and scoring (at the ms2 level).

I have ran pyprophet peptide (seemingly without an issue) with:

pyprophet peptide \
--in=merged.osw \
--context=run-specific peptide \
--in=merged.osw \
--context=experiment-wide peptide \
--in=merged.osw --context=global 

However when I try the same with pyrophet protein:

pyprophet protein \
--in=merged.osw \
--context=run-specific peptide \
--in=merged.osw \
--context=experiment-wide peptide \
--in=merged.osw --context=global 

I get the following error:

Traceback (most recent call last):
  File "c:\programdata\anaconda2\lib\runpy.py", line 174, in _run_module_as_main

    "__main__", fname, loader, pkg_name)
  File "c:\programdata\anaconda2\lib\runpy.py", line 72, in _run_code
    exec code in run_globals
  File "C:\ProgramData\Anaconda2\Scripts\pyprophet.exe\__main__.py", line 9, in
<module>
  File "c:\programdata\anaconda2\lib\site-packages\click\core.py", line 722, in
__call__
    return self.main(*args, **kwargs)
  File "c:\programdata\anaconda2\lib\site-packages\click\core.py", line 697, in
main
    rv = self.invoke(ctx)
  File "c:\programdata\anaconda2\lib\site-packages\click\core.py", line 1092, in
 invoke
    rv.append(sub_ctx.command.invoke(sub_ctx))
  File "c:\programdata\anaconda2\lib\site-packages\click\core.py", line 895, in
invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "c:\programdata\anaconda2\lib\site-packages\click\core.py", line 535, in
invoke
    return callback(*args, **kwargs)
  File "c:\programdata\anaconda2\lib\site-packages\pyprophet\main.py", line 172,
 in protein
    infer_proteins(infile, outfile, context, parametric, pfdr, pi0_lambda, pi0_m
ethod, pi0_smooth_df, pi0_smooth_log_pi0, lfdr_truncate, lfdr_monotone, lfdr_tra
nsformation, lfdr_adj, lfdr_eps)
  File "c:\programdata\anaconda2\lib\site-packages\pyprophet\levels_contexts.py"
, line 102, in infer_proteins
    data = data.groupby('run_id').apply(statistics_report, outfile, context, "pr
otein", parametric, pfdr, pi0_lambda, pi0_method, pi0_smooth_df, pi0_smooth_log_
pi0, lfdr_truncate, lfdr_monotone, lfdr_transformation, lfdr_adj, lfdr_eps).rese
t_index()
  File "c:\programdata\anaconda2\lib\site-packages\pandas\core\groupby\groupby.p
y", line 930, in apply
    return self._python_apply_general(f)
  File "c:\programdata\anaconda2\lib\site-packages\pandas\core\groupby\groupby.p
y", line 936, in _python_apply_general
    self.axis)
  File "c:\programdata\anaconda2\lib\site-packages\pandas\core\groupby\groupby.p
y", line 2273, in apply
    res = f(group)
  File "c:\programdata\anaconda2\lib\site-packages\pandas\core\groupby\groupby.p
y", line 908, in f
    return func(g, *args, **kwargs)
  File "c:\programdata\anaconda2\lib\site-packages\pyprophet\levels_contexts.py"
, line 16, in statistics_report
    error_stat, pi0 = error_statistics(data[data.decoy==0]['score'], data[data.d
ecoy==1]['score'], parametric, pfdr, pi0_lambda, pi0_method, pi0_smooth_df, pi0_
smooth_log_pi0, True, lfdr_truncate, lfdr_monotone, lfdr_transformation, lfdr_ad
j, lfdr_eps)
  File "c:\programdata\anaconda2\lib\site-packages\pyprophet\stats.py", line 465
, in error_statistics
    error_stat['pep'] = lfdr(target_pvalues, pi0['pi0'], lfdr_trunc, lfdr_monoto
ne, lfdr_transf, lfdr_adj, lfdr_eps)
  File "c:\programdata\anaconda2\lib\site-packages\pyprophet\stats.py", line 309
, in lfdr
    y = sp.interpolate.spline(myd.support, myd.density, x)
  File "c:\programdata\anaconda2\lib\site-packages\numpy\lib\utils.py", line 101
, in newfunc
    return func(*args, **kwds)
  File "c:\programdata\anaconda2\lib\site-packages\scipy\interpolate\interpolate
.py", line 2919, in spline
    return spleval(splmake(xk, yk, order=order, kind=kind, conds=conds), xnew)
  File "c:\programdata\anaconda2\lib\site-packages\numpy\lib\utils.py", line 101
, in newfunc
    return func(*args, **kwds)
  File "c:\programdata\anaconda2\lib\site-packages\scipy\interpolate\interpolate
.py", line 2828, in splmake
    coefs = func(xk, yk, order, conds, B)
  File "c:\programdata\anaconda2\lib\site-packages\scipy\interpolate\interpolate
.py", line 2752, in _find_smoothest
    p = scipy.linalg.solve(Q, tmp)
  File "c:\programdata\anaconda2\lib\site-packages\scipy\linalg\basic.py", line
216, in solve
    _solve_check(n, info)
  File "c:\programdata\anaconda2\lib\site-packages\scipy\linalg\basic.py", line
31, in _solve_check
    raise LinAlgError('Matrix is singular.')
numpy.linalg.linalg.LinAlgError: Matrix is singular.

Any ideas what could be causing this? Thanks in advance, Sian

grosenberger commented 5 years ago

Hi Sian,

The first command actually combines three independent steps:

pyprophet peptide --in=merged.osw --context=run-specific
pyprophet peptide --in=merged.osw --context=experiment-wide
pyprophet peptide --in=merged.osw --context=global

On protein-level, the analogue would be:

pyprophet protein --in=merged.osw --context=run-specific
pyprophet protein --in=merged.osw --context=experiment-wide
pyprophet protein --in=merged.osw --context=global

Could you please execute these steps individually and let me know where it fails?

Thanks, George

sian231 commented 5 years ago

Hi, so it seems there may be two issues: pyprophet protein --in=merged.osw --context=run-specific reproduces the error as above

pyprophet protein --in=merged.osw --context=experiment-wide runs without an issue

pyprophet protein --in=merged.osw --context=global produces the following:

c:\programdata\anaconda2\lib\site-packages\pyprophet\stats.py:371: RuntimeWarnin
g: invalid value encountered in true_divide
  sens = tp / (num_total - num_null)
c:\programdata\anaconda2\lib\site-packages\pyprophet\stats.py:373: RuntimeWarnin
g: invalid value encountered in less
  sens[sens < 0.0] = 0.0
c:\programdata\anaconda2\lib\site-packages\pyprophet\stats.py:374: RuntimeWarnin
g: invalid value encountered in greater
  sens[sens > 1.0] = 1.0

   qvalue  pvalue  svalue           pep    ...      tn      fp   fn    cutoff
0    0.00     1.0     NaN  1.936552e-07    ...     0.0  5910.0  0.0 -4.212924
1    0.01     NaN     NaN           NaN    ...     NaN     NaN  NaN       NaN
2    0.02     NaN     NaN           NaN    ...     NaN     NaN  NaN       NaN
3    0.05     NaN     NaN           NaN    ...     NaN     NaN  NaN       NaN
4    0.10     NaN     NaN           NaN    ...     NaN     NaN  NaN       NaN
5    0.20     NaN     NaN           NaN    ...     NaN     NaN  NaN       NaN
6    0.30     NaN     NaN           NaN    ...     NaN     NaN  NaN       NaN
7    0.40     NaN     NaN           NaN    ...     NaN     NaN  NaN       NaN
8    0.50     NaN     NaN           NaN    ...     NaN     NaN  NaN       NaN

[9 rows x 12 columns]

Traceback (most recent call last):
  File "c:\programdata\anaconda2\lib\runpy.py", line 174, in _run_module_as_main

    "__main__", fname, loader, pkg_name)
  File "c:\programdata\anaconda2\lib\runpy.py", line 72, in _run_code
    exec code in run_globals
  File "C:\ProgramData\Anaconda2\Scripts\pyprophet.exe\__main__.py", line 9, in
<module>
  File "c:\programdata\anaconda2\lib\site-packages\click\core.py", line 722, in
__call__
    return self.main(*args, **kwargs)
  File "c:\programdata\anaconda2\lib\site-packages\click\core.py", line 697, in
main
    rv = self.invoke(ctx)
  File "c:\programdata\anaconda2\lib\site-packages\click\core.py", line 1092, in
 invoke
    rv.append(sub_ctx.command.invoke(sub_ctx))
  File "c:\programdata\anaconda2\lib\site-packages\click\core.py", line 895, in
invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "c:\programdata\anaconda2\lib\site-packages\click\core.py", line 535, in
invoke
    return callback(*args, **kwargs)
  File "c:\programdata\anaconda2\lib\site-packages\pyprophet\main.py", line 172,
 in protein
    infer_proteins(infile, outfile, context, parametric, pfdr, pi0_lambda, pi0_m
ethod, pi0_smooth_df, pi0_smooth_log_pi0, lfdr_truncate, lfdr_monotone, lfdr_tra
nsformation, lfdr_adj, lfdr_eps)
  File "c:\programdata\anaconda2\lib\site-packages\pyprophet\levels_contexts.py"
, line 105, in infer_proteins
    data = statistics_report(data, outfile, context, "protein", parametric, pfdr
, pi0_lambda, pi0_method, pi0_smooth_df, pi0_smooth_log_pi0, lfdr_truncate, lfdr
_monotone, lfdr_transformation, lfdr_adj, lfdr_eps)
  File "c:\programdata\anaconda2\lib\site-packages\pyprophet\levels_contexts.py"
, line 36, in statistics_report
    save_report(outfile + "_" + context + "_" + analyte + ".pdf", outfile + ": "
 + context + " " + analyte + "-level error-rate control", data[data.decoy==1]["s
core"], data[data.decoy==0]["score"], stat_table["cutoff"], stat_table["svalue"]
, stat_table["qvalue"], data[data.decoy==0]["p_value"], pi0)
  File "c:\programdata\anaconda2\lib\site-packages\pyprophet\report.py", line 50
, in save_report
    ddensity = gaussian_kde(top_decoys)
  File "c:\programdata\anaconda2\lib\site-packages\scipy\stats\kde.py", line 169
, in __init__
    raise ValueError("`dataset` input should have multiple elements.")
ValueError: `dataset` input should have multiple elements.
grosenberger commented 5 years ago

Did you use a sample-specific library or a larger combined library? E.g. are most of your proteins detectable with at least one peptide? In that case it might be that protein-level error estimates are hard to make because there are only very few false detected proteins. One example would be using DIA-Umpire to generate a library and restrictively filtering it before using OpenSWATH.

Could you please post the PDF report in the experiment-wide context on protein level?

sian231 commented 5 years ago

We used a combined library with decoys, so I can't imagine there being only a few false detected proteins.

I have posted the PDF report - it doesn't look good.

merged.osw_experiment-wide_protein.pdf

grosenberger commented 5 years ago

Does the peak group level (MS2-level) report look as expected?

sian231 commented 5 years ago

I could be better - there is a lot of overlap between target and decoy, but this is large-scale data, so I wasn't hugely concerned when I saw it. I would say the MS1-level looks 'worse'. I've posted both. merged_ms1_report.pdf merged_ms2_report.pdf

grosenberger commented 5 years ago

The MS2-level report looks like expected from TripleTOF data. The MS1-level is not optimal, but has no effect on protein-level.

Which version of OpenMS did you use to generate the PQP files? It might be that there is an issue with the peptide to protein associations. Currently, it is still required that peptides are either proteotypic or map to predetermined protein groups (from protein inference algorithms).

sian231 commented 5 years ago

I used the latest version (OpenMS 2.4). The library does contain non-proteotypic peptides, however they should satisfy the second criteria and map to protein groups. I will try using a library of only proteotypic peptides next. Thanks for your help!

bretttully commented 5 years ago

I'm guessing this is similar.

Library is small, and I expect no overlap between targets and decoys -- these 2,000 PQPs were selected as the best scoring 2,000 from a more comprehensive library.

image

Command looks like (NB: this is a sample independent workflow, rather than a global context workflow):

#!/bin/bash
set -e
echo `pwd`

INPUT_FILES=""
INPUT_FILES+=" ProCan90-M06-07.osw"

for in_fname in ${INPUT_FILES}; do

        echo "********************************************************************************"
        echo "Working on $in_fname"

        echo "********************************************************************************"
        echo "Step 1: score model"
        pyprophet score --in=$in_fname --level=ms1ms2 \
                --pi0_lambda 0.0 0.0001 1e-05

        echo "********************************************************************************"
        echo "Step 2: calculate FDR controls"
        pyprophet peptide \
                --context=run-specific --in=$in_fname \
                --pi0_lambda 0.0 0.0001 1e-05
        pyprophet protein \
                --context=run-specific --in=$in_fname \
                --pi0_lambda 0.0 0.0001 1e-05

        echo "********************************************************************************"
        tsv_fname="${in_fname%.*}.pyprophet.tsv"
        echo "Step 3: backpropagate results and export to TSV: $in_fname --> $tsv_fname"
        pyprophet export --in=$in_fname --out=$tsv_fname --format=legacy_merged

done

Which throws an unexpected error:

********************************************************************************
Working on ProCan90-M06-07.osw
********************************************************************************
Step 1: score model
Info: Learn and apply classifier from input data.
Warning: Column var_mi_score contains only invalid/missing values. Column will be dropped.
Warning: Column var_mi_weighted_score contains only invalid/missing values. Column will be dropped.
Warning: Column var_mi_ratio_score contains only invalid/missing values. Column will be dropped.
Warning: Column var_elution_model_fit_score contains only invalid/missing values. Column will be dropped.
Warning: Column var_sonar_lag contains only invalid/missing values. Column will be dropped.
Warning: Column var_sonar_shape contains only invalid/missing values. Column will be dropped.
Warning: Column var_sonar_log_sn contains only invalid/missing values. Column will be dropped.
Warning: Column var_sonar_log_diff contains only invalid/missing values. Column will be dropped.
Warning: Column var_sonar_log_trend contains only invalid/missing values. Column will be dropped.
Warning: Column var_sonar_rsq contains only invalid/missing values. Column will be dropped.
Warning: Column var_ms1_mi_score contains only invalid/missing values. Column will be dropped.
Info: Data set contains 1990 decoy and 2000 target groups.
Info: Summary of input data:
Info: 40636 peak groups
Info: 3990 group ids
Info: 26 scores including main score
Info: Semi-supervised learning of weights:
Info: Start learning on 10 folds using 1 processes.
Info: Learning on cross-validation fold.
Info: Learning on cross-validation fold.
Info: Learning on cross-validation fold.
Info: Learning on cross-validation fold.
Info: Learning on cross-validation fold.
Info: Learning on cross-validation fold.
Info: Learning on cross-validation fold.
Info: Learning on cross-validation fold.
Info: Learning on cross-validation fold.
Info: Learning on cross-validation fold.
Info: Finished learning.
Info: Data set contains 1990 decoy and 2000 target groups.
Info: Mean qvalue = 2.412224e-03, std_dev qvalue = 4.371784e-04
Info: Mean svalue = 1.000000e+00, std_dev svalue = 0.000000e+00
Info: Finished scoring and estimation statistics.
Info: Processing input data finished.
Info: Time needed: 00:00:3.7
================================================================================
   qvalue    pvalue  svalue    ...           fp           fn    cutoff
0    0.00  0.000503     1.0    ...     1.005025 -1996.994975  5.542891
1    0.01  0.002513     1.0    ...     5.025126 -1994.974874  3.839992
2    0.02       NaN     NaN    ...          NaN          NaN       NaN
3    0.05       NaN     NaN    ...          NaN          NaN       NaN
4    0.10       NaN     NaN    ...          NaN          NaN       NaN
5    0.20       NaN     NaN    ...          NaN          NaN       NaN
6    0.30       NaN     NaN    ...          NaN          NaN       NaN
7    0.40       NaN     NaN    ...          NaN          NaN       NaN
8    0.50       NaN     NaN    ...          NaN          NaN       NaN

[9 rows x 12 columns]
================================================================================
Info: ProCan90-M06-07.osw written.
Info: ProCan90-M06-07_ms1ms2_report.pdf written.
Info: Total time: 3 seconds and 684 msecs wall time
********************************************************************************
Step 2: calculate FDR controls
================================================================================
   qvalue    pvalue  svalue    ...           fp           fn    cutoff
0    0.00  0.000503     1.0    ...     1.005025 -1996.994975  5.542891
1    0.01  0.002513     1.0    ...     5.025126 -1994.974874  3.839992
2    0.02       NaN     NaN    ...          NaN          NaN       NaN
3    0.05       NaN     NaN    ...          NaN          NaN       NaN
4    0.10       NaN     NaN    ...          NaN          NaN       NaN
5    0.20       NaN     NaN    ...          NaN          NaN       NaN
6    0.30       NaN     NaN    ...          NaN          NaN       NaN
7    0.40       NaN     NaN    ...          NaN          NaN       NaN
8    0.50       NaN     NaN    ...          NaN          NaN       NaN

[9 rows x 12 columns]
================================================================================
================================================================================
   qvalue    pvalue  svalue    ...           fp           fn    cutoff
0    0.00  0.000503     1.0    ...     1.005025 -1996.994975  5.542891
1    0.01  0.002513     1.0    ...     5.025126 -1994.974874  3.839992
2    0.02       NaN     NaN    ...          NaN          NaN       NaN
3    0.05       NaN     NaN    ...          NaN          NaN       NaN
4    0.10       NaN     NaN    ...          NaN          NaN       NaN
5    0.20       NaN     NaN    ...          NaN          NaN       NaN
6    0.30       NaN     NaN    ...          NaN          NaN       NaN
7    0.40       NaN     NaN    ...          NaN          NaN       NaN
8    0.50       NaN     NaN    ...          NaN          NaN       NaN

[9 rows x 12 columns]
================================================================================
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/pandas/core/groupby/groupby.py", line 918, in apply
    result = self._python_apply_general(f)
  File "/opt/conda/lib/python3.6/site-packages/pandas/core/groupby/groupby.py", line 936, in _python_apply_general
    self.axis)
  File "/opt/conda/lib/python3.6/site-packages/pandas/core/groupby/groupby.py", line 2273, in apply
    res = f(group)
  File "/opt/conda/lib/python3.6/site-packages/pandas/core/groupby/groupby.py", line 908, in f
    return func(g, *args, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/pyprophet/levels_contexts.py", line 16, in statistics_report
    error_stat, pi0 = error_statistics(data[data.decoy==0]['score'], data[data.decoy==1]['score'], parametric, pfdr, pi0_lambda, pi0_method, pi0_smooth_df, pi0_smooth_log_pi0, True, lfdr_truncate, lfdr_monotone, lfdr_transformation, lfdr_adj, lfdr_eps)
  File "/opt/conda/lib/python3.6/site-packages/pyprophet/stats.py", line 468, in error_statistics
    error_stat['pep'] = lfdr(target_pvalues, pi0['pi0'], lfdr_trunc, lfdr_monotone, lfdr_transf, lfdr_adj, lfdr_eps)
  File "/opt/conda/lib/python3.6/site-packages/pyprophet/stats.py", line 309, in lfdr
    splinefit = sp.interpolate.make_interp_spline(myd.support, myd.density)
  File "/opt/conda/lib/python3.6/site-packages/scipy/interpolate/_bsplines.py", line 777, in make_interp_spline
    raise ValueError("Expect x to be a 1-D sorted array_like.")
ValueError: Expect x to be a 1-D sorted array_like.

ms1ms2 report: image

PyProphet installed using:

pip install git+https://github.com/PyProphet/pyprophet.git@086bf5de4abc10f8592665ac9b8d72db14ffc7b4
grosenberger commented 5 years ago

@bretttully So the error message is also on protein-level, right? Could you please share the OSW file so I can reproduce the error? Most likely this will require better exception management for such extremely separated distributions.

bretttully commented 5 years ago

File here: https://www.dropbox.com/s/it8r91u4bvynk0m/ProCan90-M06-07.osw?dl=0

grosenberger commented 5 years ago

@bretttully Could you please review the PR above? I will then merge with master and make a new release with the two patches.

bretttully commented 5 years ago

Done. Looks good.