bigbio / quantms

Quantitative mass spectrometry workflow. Currently supports proteomics experiments with complex experimental designs for DDA-LFQ, DDA-Isobaric and DIA-LFQ quantification.
https://quantms.org
MIT License
28 stars 35 forks source link

proteomicsLFQ fails with empty peptides #266

Open ypriverol opened 1 year ago

ypriverol commented 1 year ago

Description of the bug

Im processing a dataset using proteomicsLFQ and it fails when it found empty list of peptides in one of the idXMLs.

TOPPBase.cpp(1605): Value of string option 'keep_feature_top_psm_only': true
TOPPBase.cpp(1605): Processing file: 20151220_alr_CompleteHumanProteome_HUVEC_LysC_ETD_fr6.mzML
TOPPBase.cpp(1615):  - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
2023-04-07 14:58:20 ProteomicsLFQ:1: Parameters passed to PeakPickerHiRes algorithm
"signal_to_noise" -> "0" (Minimal signal-to-noise ratio for a peak to be picked (0.0 disables SNT estimation!))
"spacing_difference_gap" -> "4" (The extension of a peak is stopped if the spacing between two subsequent data points exceeds 'spacing_difference_gap * min_spacing'. 'min_spacing' is the smaller of the two spacings from the peak apex to its two neighboring points. '0' to disable the constraint. Not applicable to chromatograms.)
"spacing_difference" -> "1.5" (Maximum allowed difference between points during peak extension, in multiples of the minimal difference between the peak apex and its two neighboring points. If this difference is exceeded a missing point is assumed (see parameter 'missing'). A higher value implies a less stringent peak definition, since individual signals within the peak are allowed to be further apart. '0' to disable the constraint. Not applicable to chromatograms.)
"missing" -> "1" (Maximum number of missing points allowed when extending a peak to the left or to the right. A missing data point occurs if the spacing between two subsequent data points exceeds 'spacing_difference * min_spacing'. 'min_spacing' is the smaller of the two spacings from the peak apex to its two neighboring points. Not applicable to chromatograms.)
"ms_levels" -> "[]" (List of MS levels for which the peak picking is applied. If empty, auto mode is enabled, all peaks which aren't picked yet will get picked. Other scans are copied to the output without changes.)
"report_FWHM" -> "false" (Add metadata for FWHM (as floatDataArray named 'FWHM' or 'FWHM_ppm', depending on param 'report_FWHM_unit') for each picked peak.)
"report_FWHM_unit" -> "relative" (Unit of FWHM. Either absolute in the unit of input, e.g. 'm/z' for spectra, or relative as ppm (only sensible for spectra, not chromatograms).)
"SignalToNoise|max_intensity" -> "-1" (maximal intensity considered for histogram construction. By default, it will be calculated automatically (see auto_mode). Only provide this parameter if you know what you are doing (and change 'auto_mode' to '-1')! All intensities EQUAL/ABOVE 'max_intensity' will be added to the LAST histogram bin. If you choose 'max_intensity' too small, the noise estimate might be too small as well.  If chosen too big, the bins become quite large (which you could counter by increasing 'bin_count', which increases runtime). In general, the Median-S/N estimator is more robust to a manual max_intensity than the MeanIterative-S/N.)
"SignalToNoise|auto_max_stdev_factor" -> "3" (parameter for 'max_intensity' estimation (if 'auto_mode' == 0): mean + 'auto_max_stdev_factor' * stdev)
"SignalToNoise|auto_max_percentile" -> "95" (parameter for 'max_intensity' estimation (if 'auto_mode' == 1): auto_max_percentile th percentile)
"SignalToNoise|auto_mode" -> "0" (method to use to determine maximal intensity: -1 --> use 'max_intensity'; 0 --> 'auto_max_stdev_factor' method (default); 1 --> 'auto_max_percentile' method)
"SignalToNoise|win_len" -> "200" (window length in Thomson)
"SignalToNoise|bin_count" -> "30" (number of bins for intensity values)
"SignalToNoise|min_required_elements" -> "10" (minimum number of elements required in a window (otherwise it is considered sparse))
"SignalToNoise|noise_for_empty_window" -> "1e+20" (noise value used for sparse windows)
"SignalToNoise|write_log_messages" -> "true" (Write out log messages in case of sparse windows or median in rightmost histogram bin)
 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Progress of 'loading spectra list':
XMLHandler.cpp(170): While loading '20151220_alr_CompleteHumanProteome_HUVEC_LysC_ETD_fr6.mzML': Unhandled attribute 'instrumentConfigurationRef' in 'scan' tag.
-- done [took 01:59 m (CPU), 4.77 s (Wall)] -- 
Progress of 'loading chromatogram list':
-- done [took 0.00 s (CPU), 0.00 s (Wall)] -- 
Progress of 'picking peaks':
-- done [took 0.15 s (CPU), 0.16 s (Wall)] -- 
0 spectra and 0 chromatograms stored.
#Spectra that needed to and could be picked by MS-level:
  MS-level 1: 0 / 2364
  MS-level 2: 0 / 28902
Info: Corrected 28902 precursors.
Precursor correction:
  median        = 2.579661029498784e-10 ppm  MAD = 5.696144254224267e-10
  median (abs.) = 5.26636960313316e-10 ppm  MAD = 3.032558272348794e-10
FASTAContainer.h(404): decoy_   20686   0
Using prefix decoy string 'DECOY_'
Info: using 'Lys-C' as enzyme (obtained from idXML) for digestion.
Peptide identification engine: COMET
Enzyme: Lys-C
Info: using 'full' as enzyme specificity (obtained from idXML) for digestion.
Warning: An empty set of peptide identifications was provided. Output will be empty as well.
<XMLHandler.cpp(170): While loading '20151220_alr_CompleteHumanProteome_HUVEC_LysC_ETD_fr6.mzML': Unhandled attribute 'instrumentConfigurationRef' in 'scan' tag.> occurred 31266 times
TOPPBase.cpp(1605): Value of string option 'mass_recalibration': false
TOPPBase.cpp(1615):  - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
2023-04-07 14:58:25 ProteomicsLFQ:1: Parameters passed to MassTraceDetection
"mass_error_ppm" -> "20" (Allowed mass deviation (in ppm).)
"noise_threshold_int" -> "10" (Intensity threshold below which peaks are removed as noise.)
"chrom_peak_snr" -> "3" (Minimum intensity above noise_threshold_int (signal-to-noise) a peak should have to be considered an apex.)
"reestimate_mt_sd" -> "true" (Enables dynamic re-estimation of m/z variance during mass trace collection stage.)
"quant_method" -> "area" (Method of quantification for mass traces. For LC data 'area' is recommended, 'median' for direct injection data. 'max_height' simply uses the most intense peak in the trace.)
"trace_termination_criterion" -> "outlier" (Termination criterion for the extension of mass traces. In 'outlier' mode, trace extension cancels if a predefined number of consecutive outliers are found (see trace_termination_outliers parameter). In 'sample_rate' mode, trace extension in both directions stops if ratio of found peaks versus visited spectra falls below the 'min_sample_rate' threshold.)
"trace_termination_outliers" -> "5" (Mass trace extension in one direction cancels if this number of consecutive spectra with no detectable peaks is reached.)
"min_sample_rate" -> "0.5" (Minimum fraction of scans along the mass trace that must contain a peak.)
"min_trace_length" -> "5" (Minimum expected length of a mass trace (in seconds).)
"max_trace_length" -> "-1" (Maximum expected length of a mass trace (in seconds). Set to a negative value to disable maximal length check during mass trace detection.)
 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Progress of 'mass trace detection':
-- done [took 0.15 s (CPU), 0.15 s (Wall)] -- 
Median chromatographic FWHM: 7.29895
TOPPBase.cpp(1605): Value of string option 'targeted_only': true
TOPPBase.cpp(1615):  - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
2023-04-07 14:58:27 ProteomicsLFQ:1: Parameters passed to FeatureFinderIdentification algorithm
"candidates_out" -> "" (Optional output file with feature candidates.)
"debug" -> "1000" (Debug level for feature detection.)
"quantify_decoys" -> "true" (Whether decoy peptides should be quantified (true) or skipped (false).)
"min_psm_cutoff" -> "none" (Minimum score for the best PSM of a spectrum to be used as seed. Use 'none' for no cutoff.)
"extract|batch_size" -> "5000" (Nr of peptides used in each batch of chromatogram extraction. Smaller values decrease memory usage but increase runtime.)
"extract|mz_window" -> "10" (m/z window size for chromatogram extraction (unit: ppm if 1 or greater, else Da/Th))
"extract|n_isotopes" -> "2" (Number of isotopes to include in each peptide assay.)
"extract|isotope_pmin" -> "0" (Minimum probability for an isotope to be included in the assay for a peptide. If set, this parameter takes precedence over 'extract:n_isotopes'.)
"extract|rt_quantile" -> "0.95" (Quantile of the RT deviations between aligned internal and external IDs to use for scaling the RT extraction window)
"extract|rt_window" -> "0" (RT window size (in sec.) for chromatogram extraction. If set, this parameter takes precedence over 'extract:rt_quantile'.)
"detect|min_peak_width" -> "0.2" (Minimum elution peak width. Absolute value in seconds if 1 or greater, else relative to 'peak_width'.)
"detect|signal_to_noise" -> "0.8" (Signal-to-noise threshold for OpenSWATH feature detection)
"detect|mapping_tolerance" -> "0" (RT tolerance (plus/minus) for mapping peptide IDs to features. Absolute value in seconds if 1 or greater, else relative to the RT span of the feature.)
"detect|peak_width" -> "36.4948"
"svm|samples" -> "10000" (Number of observations to use for training ('0' for all))
"svm|no_selection" -> "false" (By default, roughly the same number of positive and negative observations, with the same intensity distribution, are selected for training. This aims to reduce biases, but also reduces the amount of training data. Set this flag to skip this procedure and consider all available observations (subject to 'svm:samples').)
"svm|xval_out" -> "" (Output file: SVM cross-validation (parameter optimization) results)
"svm|kernel" -> "RBF" (SVM kernel)
"svm|xval" -> "5" (Number of partitions for cross-validation (parameter optimization))
"svm|log2_C" -> "[-2, 5, 15]" (Values to try for the SVM parameter 'C' during parameter optimization. A value 'x' is used as 'C = 2^x'.)
"svm|log2_gamma" -> "[-3, -1, 2]" (Values to try for the SVM parameter 'gamma' during parameter optimization (RBF kernel only). A value 'x' is used as 'gamma = 2^x'.)
"svm|log2_p" -> "[-15, -12, -9, -6, -3.32193, 0, 3.32193, 6, 9, 12, 15]" (Values to try for the SVM parameter 'epsilon' during parameter optimization (epsilon-SVR only). A value 'x' is used as 'epsilon = 2^x'.)
"svm|epsilon" -> "0.001" (Stopping criterion)
"svm|cache_size" -> "100" (Size of the kernel cache (in MB))
"svm|no_shrinking" -> "false" (Disable the shrinking heuristics)
"svm|predictors" -> "peak_apices_sum,var_xcorr_coelution,var_xcorr_shape,var_library_sangle,var_intensity_score,sn_ratio,var_log_sn_score,var_elution_model_fit_score,xx_lda_prelim_score,var_ms1_isotope_correlation_score,var_ms1_isotope_overlap_score,var_massdev_score,main_var_xx_swath_prelim_score" (Names of OpenSWATH scores to use as predictors for the SVM (comma-separated list))
"svm|min_prob" -> "0.9" (Minimum probability of correctness, as predicted by the SVM, required to retain a feature candidate)
"model|type" -> "symmetric" (Type of elution model to fit to features)
"model|add_zeros" -> "0.2" (Add zero-intensity points outside the feature range to constrain the model fit. This parameter sets the weight given to these points during model fitting; '0' to disable.)
"model|unweighted_fit" -> "false" (Suppress weighting of mass traces according to theoretical intensities when fitting elution models)
"model|no_imputation" -> "false" (If fitting the elution model fails for a feature, set its intensity to zero instead of imputing a value from the initial intensity estimate)
"model|each_trace" -> "false" (Fit elution model to each individual mass trace)
"model:check|min_area" -> "1" (Lower bound for the area under the curve of a valid elution model)
"model:check|boundaries" -> "0.5" (Time points corresponding to this fraction of the elution model height have to be within the data region used for model fitting)
"model:check|width" -> "10" (Upper limit for acceptable widths of elution models (Gaussian or EGH), expressed in terms of modified (median-based) z-scores. '0' to disable. Not applied to individual mass traces (parameter 'each_trace').)
"model:check|asymmetry" -> "10" (Upper limit for acceptable asymmetry of elution models (EGH only), expressed in terms of modified (median-based) z-scores. '0' to disable. Not applied to individual mass traces (parameter 'each_trace').)
"EMGScoring|max_iteration" -> "100" (Maximum number of iterations for EMG fitting.)
"EMGScoring|init_mom" -> "true" (Alternative initial parameters for fitting through method of moments.)
 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
RT window size calculated as 145.979 seconds.
Preparing mapping of peptide data...
#Seeds without RT and m/z overlap with identified peptides added: 0
Creating full assay library for debugging.
Writing debug.traml file.
Progress of 'Creating assay library and extracting chromatograms':
FeatureFinderIdentificationAlgorithm.cpp(482): #Transitions: 0
FeatureFinderIdentificationAlgorithm.cpp(500): Extracted 0 chromatogram(s).
FeatureFinderIdentificationAlgorithm.cpp(503): Detecting chromatographic peaks...
Will analyse 0 peptides with a total of 0 transitions 
-- done [took 0.20 s (CPU), 0.20 s (Wall)] -- 
Found 0 feature candidates in total.
0 features left after filtering.
Error: Unexpected internal error (No features provided.)
TOPPBase.cpp(1605): Error occurred in line 204 of file /opt/conda/conda-bld/openms-meta_1678652294070/work/src/openms/source/TRANSFORMATIONS/FEATUREFINDER/ElutionModelFitter.cpp (in function: void OpenMS::ElutionModelFitter::fitElutionModels(OpenMS::FeatureMap&)) !

Command used and terminal output

No response

Relevant files

No response

System information

No response

timosachsenberg commented 1 year ago

@jpfeuffer should we handle this in quantMS or proteomicsLFQ? My guess would be that this usually indicates that something went wrong in the analysis and it is ok to fail. For large scale reanalysis it might make sense to return no quants though. What is your take on that?

jpfeuffer commented 1 year ago

Hmm yes, a bit tricky but I would say we should handle it in the PLFQ tool. I am not 100% sure if it is possible regarding map alignment. Some algorithms may rely on identifications.

ypriverol commented 1 year ago

I do understand that you can deal with that in quantms. However, if you have a 400 project, and only 5 files are wrong, you need to delete one by one those files, which means are 5 wrong executions and edits of the corresponding SDRF. I think it should be deal by PLFQ and output some warning with all the mzMLs with no ids.

BTW, this is the non-tryptic searches: (PXD024364-Lys-C), (Chymotrypsin), and others. Only working with comet.

jpfeuffer commented 1 year ago

I think @timosachsenberg means to filter the files in nextflow and just not pass them into the next step when they fail. I think by now we handle subsets of experimental designs so, it should be possible.

timosachsenberg commented 1 year ago

maybe fixed in https://github.com/OpenMS/OpenMS/pull/6825