Closed mriffle closed 1 year ago
The .target.txt
is a list at the peptide level with rigorous FDR control, while .target_mods.txt
provides an auxiliary list at the PSM level, but we do not promise FDR control.
In other words, you should find more rows in the .target_mods.txt
file. The .target_mods.txt
contains the two extra columns since we indicate whether the PSM was part of the original discovery list in the .target.txt
(originally_discovered) and whether it scored sufficiently high (above_group_threshold), giving the user some indication of confidence behind these PSMs.
Right now I'm only processing the .target.txt
file for PSMs that I import into Limelight. Is that what you'd recommend?
The .target_mods.txt I think is the preferable one for the user, since it will contain at least the PSMs in the .target.txt
anyhow.
I should add that these extra columns (localized_peptide
, localized_better
, open_mod_localization
) that I mention here that can be found in the .target_mods.txt
file require an additional input of the mzML file used during searching. So as an example, I could use the following command:
python -m CONGA --FDR_threshold 0.1 --overwrite T --spectrum_files extra_tests/OR20070924_S_mix7_02.mzML extra_tests/dcy_full_0.tide-search.txt extra_tests/open_top5_full_0.tide-search.txt
Details on the --spectrum_files
option can be found over on readthedocs (I don't think you will need it, but in case).
The files in the command above I will send to you via Slack. FYI these MS/MS scans are derived from a well-controlled experiment (the ISB18 data set) which are generated from only a handful of proteins (which is why I set the FDR threshold to something stupidly large like 0.1 -- just for demonstration purposes).
How would you recommend I filter the results, by default, when showing the results to users in Limelight? For example, with percolator output, I can limit the results to only peptides with at least one PSM with a q-value <= 0.01 and a peptide-level q-value <= 0.01. Is there an analogous set of filters you can think of for showing results from .target_mods.txt?
CONGA does not report q-values, mainly so that users do not cheat their analysis by observing their results post-hoc. CONGA has an --FDR_threshold option at the start that controls for the amount of false discoveries and ultimately the size of the output.
I would say that users would like to see all rows in the .target_mods.txt with above_group_threshold
being True. This provides some confidence to the user regarding these matches.
Limelight XML converter has been updated to read the .target_mods.txt file and to use above_group_threshold
as a default filter when presenting results (user can turn the filter off).
*.target.txt contains these fields:
*.target_mods.txt contains these fields:
So the target_mods.txt file contains two columns (originally_discovered and above_group_threshold) not found in target.txt.
Is there any other difference between these files, do they contain the same PSMs?