Filter retention_length required a data column retention_length

mofhu commented 5 years ago

Hi there,

First, please let me thanks for your kindness of open the repo for the whole community! I'm adapting DART-ID for Sequest and Byonic output. When I run the modified code with all required columns (i.e. sequence: Modified sequence; raw_file: Raw file; retention_time: Retention time; pep: PEP). I bumped into the error below:

dart_id -c input_wrappers/config-no-rt-length.yaml

2019-05-24 15:48:17 [ERROR]  Filter retention_length required a data column retention_length, but this was not found in the input dataframe.
Traceback (most recent call last):
  File "/Users/mofrankhu/.pyenv/versions/3.7.2/bin/dart_id", line 11, in <module>
    load_entry_point('dart-id', 'console_scripts', 'dart_id')()
  File "/DART-ID/dart_id/update.py", line 355, in main
    df, df_original = process_files(config)
  File "/DART-ID/dart_id/converter.py", line 372, in process_files
    df = filter_psms(df, config)
  File "/DART-ID/dart_id/converter.py", line 281, in filter_psms
    raise ConfigFileError('Filter {} required a data column {}, but this was not found in the input dataframe.'.format(f['name'], j))
dart_id.exceptions.ConfigFileError: Filter retention_length required a data column retention_length, but this was not found in the input dataframe.

I added a column with fixed rt length (e.g. 0.1) to avoid this error, and the program seems worked.

I guess there might be some confusion in the comment of the config.yaml sample file. Could you please help me make sure which columns should I contained in the input file in the current version of DART-ID?

However, as standard PD output at PSM level do not contain RT length, could you please also comment on what does this (Retention length in MaxQuant evidence.txt) mean? so I could find equivalent in PD output.

BTW, the wrapper I used is committed to my fork of this repo https://github.com/mofhu/DART-ID. please feel free to use it if you would like to.

Many thanks! Mo

atc3 commented 5 years ago

Hi Mo,

The configuration file actually provides column re-mapping functionality, so you don't have to rename columns manually. Simply set your configuration file like this:

# column mappings for Sequest
col_names:
  sequence: "Annotated Sequence"
  raw_file: "Spectrum File"
  retention_time: "RT [min]"
  pep: "Percolator PEP"

  # optional columns
  leading_protein: "Master Protein Accessions"
  proteins: "Protein Accessions"

...and same for Byonic.

Could you please help me make sure which columns should I contained in the input file in the current version of DART-ID?

The four columns at the top (sequence, raw_file, retention_time, and pep) are required.

charge is used to append ion charge states to sequences, if you want to treat different charge states as different peptide species.

proteins and leading_protein is used for the Fido protein inference algorithm

However, as standard PD output at PSM level do not contain RT length, could you please also comment on what does this (Retention length in MaxQuant evidence.txt) mean? so I could find equivalent in PD output.

retention_length is what MaxQuant calls the "base peak width", i.e., the time range between when an ion first elutes to when it last elutes. We use this as a quality score in order to filter out poorly retained ions.

I'll document the uses for these columns better, in the configuration files and the documentation on the website. It might also be useful to just rename "retention_length" to something more understandable like "base_peak_width" so non-MaxQuant people can get what's going on.

atc3 commented 5 years ago

Fixed with:

4dc781c, 2d6abc1: Do not apply filters as a default
52ab851: Add descriptions to column mappings in annotated config file
40f3141: Add example configuration file for sequest input data
cb7fa97: Add input format page to documentation site

mofhu commented 5 years ago

Thanks a lot for your kind explanation and super speed fix! I'll take a look at retention_length and try to follow-up with a possible solution for PD users (i guess in principle, their LFQ peak detector could do the same thing) one possible question in current setting is PD only analyze MS2 spectra for default DDA workflow, report time of PSMs, no peak is extracted in common workflow. I'm not sure if my current setting of a fixed retention_width will lead to higher false-positives. as far as from my poor statistics, default in Percolator and other method to calculate FDR (and PEP) should be based on finding differences in parameters. so a fixed column should have minimal resolving power of the dataset. so I should at least not getting worse result. If i'm right with it, fixed column of peak width might limit the power of DART-ID in the result. I'll test with your demo evidence.txt from MQ output, alternating this column and see what happens.

Maybe I could do some analysis recently, and discuss with your team in SCP meeting at Boston next month.

atc3 commented 5 years ago

Including poorly-retained ions is essentially just injecting noise into the model. It might make it harder for the optimizer to descend but the biggest effect is just globally inflating the variance in RT estimates (by a small amount - as most ions are well retained).

So, including these poorly retained ions will probably not change the resulting parameters/variances by much. I think I included that filter just to make it easier for the optimizer.

mofhu commented 5 years ago

just to note my test on provided evidence.txt:

conclusion: (at least for that column), using RT_width or not is not a big issue in this model.

MQ:
filtered df with PEP < 0.01, resulting 42075 rows.
filtered df with PEP < 0.001, resulting 23673 rows.
filtered df with PEP < 0.0001, resulting 11811 rows.
DART-ID:
filtered df with dart_PEP < 0.01, resulting 76967 rows.
filtered df with dart_PEP < 0.001, resulting 56377 rows.
filtered df with dart_PEP < 0.0001, resulting 33163 rows.
filtered df with dart_qval < 0.01, resulting 96460 rows.
filtered df with dart_qval < 0.001, resulting 75698 rows.
filtered df with dart_qval < 0.0001, resulting 47199 rows.
DART-ID with fixed rt-length:
filtered df with dart_PEP < 0.01, resulting 78391 rows. (+1.9%)
filtered df with dart_PEP < 0.001, resulting 57279 rows.
filtered df with dart_PEP < 0.0001, resulting 33603 rows.
filtered df with dart_qval < 0.01, resulting 97926 rows. (+1.5%)
filtered df with dart_qval < 0.001, resulting 77125 rows.
filtered df with dart_qval < 0.0001, resulting 47741 rows.

SlavovLab / DART-ID

Filter retention_length required a data column retention_length #4