freejstone / CONGA

Implementation of the CONGA algorithm for a combined open and narrow search between experimental spectra and a peptide database.
MIT License
3 stars 1 forks source link

Question: What is the difference between *.target.txt and *.target_mods.txt ? #26

Closed mriffle closed 1 year ago

mriffle commented 1 year ago

*.target.txt contains these fields:

score   delta_mass      peptide rank    search_file     charge  spectrum_neutral_mass   scan    protein modification_info

*.target_mods.txt contains these fields:

score   delta_mass      peptide rank    search_file     charge  spectrum_neutral_mass   scan    protein originally_discovered   above_group_threshold   modification_info

So the target_mods.txt file contains two columns (originally_discovered and above_group_threshold) not found in target.txt.

Is there any other difference between these files, do they contain the same PSMs?

freejstone commented 1 year ago

The .target.txt is a list at the peptide level with rigorous FDR control, while .target_mods.txt provides an auxiliary list at the PSM level, but we do not promise FDR control.

In other words, you should find more rows in the .target_mods.txt file. The .target_mods.txt contains the two extra columns since we indicate whether the PSM was part of the original discovery list in the .target.txt (originally_discovered) and whether it scored sufficiently high (above_group_threshold), giving the user some indication of confidence behind these PSMs.

mriffle commented 1 year ago

Right now I'm only processing the .target.txt file for PSMs that I import into Limelight. Is that what you'd recommend?

freejstone commented 1 year ago

The .target_mods.txt I think is the preferable one for the user, since it will contain at least the PSMs in the .target.txt anyhow.

I should add that these extra columns (localized_peptide, localized_better, open_mod_localization) that I mention here that can be found in the .target_mods.txt file require an additional input of the mzML file used during searching. So as an example, I could use the following command:

python -m CONGA --FDR_threshold 0.1 --overwrite T --spectrum_files extra_tests/OR20070924_S_mix7_02.mzML extra_tests/dcy_full_0.tide-search.txt extra_tests/open_top5_full_0.tide-search.txt

Details on the --spectrum_files option can be found over on readthedocs (I don't think you will need it, but in case).

The files in the command above I will send to you via Slack. FYI these MS/MS scans are derived from a well-controlled experiment (the ISB18 data set) which are generated from only a handful of proteins (which is why I set the FDR threshold to something stupidly large like 0.1 -- just for demonstration purposes).

mriffle commented 1 year ago

How would you recommend I filter the results, by default, when showing the results to users in Limelight? For example, with percolator output, I can limit the results to only peptides with at least one PSM with a q-value <= 0.01 and a peptide-level q-value <= 0.01. Is there an analogous set of filters you can think of for showing results from .target_mods.txt?

freejstone commented 1 year ago

CONGA does not report q-values, mainly so that users do not cheat their analysis by observing their results post-hoc. CONGA has an --FDR_threshold option at the start that controls for the amount of false discoveries and ultimately the size of the output.

I would say that users would like to see all rows in the .target_mods.txt with above_group_threshold being True. This provides some confidence to the user regarding these matches.

mriffle commented 1 year ago

Limelight XML converter has been updated to read the .target_mods.txt file and to use above_group_threshold as a default filter when presenting results (user can turn the filter off).