Closed daichengxin closed 1 year ago
@daichengxin can you add some description about How the data was analyzed with MaxQuant and the original paper.
Thank you for the analysis. Very interesting. Can we maybe run the analysis with looser FDR cutoffs to see if it's more an identification problem or a quantification problem?
@jpfeuffer a couple of ideas:
In single cell, looks for me that MBR would be doing most of the job for the single cell runs. If you see we ID most of the peptides in the 20 and 100 samples (complex samples), more than MQ. Those samples are used as "channels" to look for poor signals from the single cell files. My point is the following:
The MBR feature with a matching window of 0.7 min and an alignment window of 20 min, was activated.
Do you think this 20 min window that they use for MBR in MQ can be the difference with us?
Do you think we need to do something in the experimental design.? For the MBR we are annotating every sample as a biological replicate to enable MBRs.
Yes, if we find out that it's not the identification performance we will check quantification. But if you have only 100 proteins that pass FDR, you cannot quantify more than that and it is useless to check quantification.
@daichengxin can we check in the output of MQ of the original project how many peptides from the single cell samples are quantify/identified using MBR, I think MQ use a notation like No MS/MS
or something like that.
More than half of the unique peptides originated with the MBR.
This is what I thought, see @jpfeuffer The majority of peptides are not coming from IDs but from the MBR, then we should see how we fail to do proper MBR in our side, my previous two comments.
Still.. if they don't pass FDR, there is nothing to quantify. So we have to wait for the ID analysis
Almost all of the peptides quantified only in MQ are from MBR. However, for quantms, these peptides also are only quantified in other files rather than from same file. it suggest we fail to do proper MBR, right?
Can you please add the command you used for quantms?
nextflow run /hps/nobackup/juan/pride/reanalysis/quantms/main.nf -c /hps/nobackup/juan/pride/reanalysis/quantms/nextflow.config -profile ebislurm --input PXD016921-MBR.sdrf.tsv --search_engines comet,msgf --root_folder /hps/nobackup/juan/pride/reanalysis/single-cell/PXD016921/ --local_input_type raw --outdir PXD016921-MBR --database /hps/nobackup/juan/pride/reanalysis/multiomics-configs/databases/Homo-sapiens-uniprot-reviewed-contaminants-decoy-202210.fasta --protein_level_fdr_cutoff 0.01 --psm_level_fdr_cutoff 0.01 --quantify_decoys true --transfer_ids mean --targeted_only false --skip_post_msstats false --enable_pmultiqc true -resume
@jpfeuffer
I don't know. I can't see any signs of transfer in the Proteomicslfq logs. Do you have the call to proteomicslfq?
The command with the relapsed FDR is:
nextflow run /hps/nobackup/juan/pride/reanalysis/quantms/main.nf -c /hps/nobackup/juan/pride/reanalysis/quantms/nextflow.config -profile ebislurm --input PXD016921-MBR.sdrf.tsv --search_engines comet,msgf --root_folder /hps/nobackup/juan/pride/reanalysis/single-cell/PXD016921/ --local_input_type raw --outdir PXD016921-MBR --database /hps/nobackup/juan/pride/reanalysis/multiomics-configs/databases/Homo-sapiens-uniprot-reviewed-contaminants-decoy-202210.fasta --protein_level_fdr_cutoff 0.15 --psm_level_fdr_cutoff 0.15 --quantify_decoys true --transfer_ids mean --targeted_only false --skip_post_msstats false --enable_pmultiqc true -resume
Here the results: http://ftp.pride.ebi.ac.uk/pub/databases/pride/resources/proteomes/quantms-benchmark/PXD016921-MBR/
I found the plfq command. Looks correct. One thing is, that we only transfer_ids if they occur in more than 50% of the samples. I don't know how MQ does it.
@jpfeuffer @ypriverol ran the data with a more relaxed FDR 0.15 at Protein and PSM level. However, the number of quantified peptides did not increase significantly. Similarily, almost all of the peptides quantified only in MQ are from MBR. For quantms, these peptides also are only quantified in other files rather than from same file.
I think if we loosen the percentage of aligned samples required to trigger requantification, we should think about an FDR approach for transferred quantification first. E.g. shifted decoy EICs
You can use the new container tomorrow to play around with a parameter that controls this proportion. You need to expose it quickly in the pipeline.
Can you let me know what is the parameter that we will need to use.
id_transfer_threshold
What is the default value?
0.5
@timosachsenberg @jpfeuffer @daichengxin I ran again the data with the same parameters but now with id_transfer_threshold
as 0.10
I don't see any improvements in the results. Can you double check @daichengxin?
Here the data: http://ftp.pride.ebi.ac.uk/pub/databases/pride/resources/proteomes/quantms-benchmark/PXD016921-MBR/
Wow! The number os unique peptides and proteins have increased a lot. But false positives have also increased. Further adjustment of id_transfer_threshold may be needed?
Can you add the False positives and the MaxQuant results in the table?
To be honest, without FDR control the settings we can do will not be detailed enough. Sure you could require at least two samples with IDs but it will always pick up everything in the blank sample even if the correlation etc is complete trash.
To be honest, without FDR control the settings we can do will not be detailed enough. Sure you could require at least two samples with IDs but it will always pick up everything in the blank sample even if the correlation etc is complete trash.
I know, I want to see how much False positives do we have when moving over some thresholds.
Do you have an idea about which kind of algorithm we can implement to really control the FDR for this type of experiment?
The runs have quite a big alignment error for some masses. I will take a look at the consensusXML. Maybe if we apply stricter RT filtering on transferred quants, we already get rid of those false/spurious links from the blank sample.
Alignment based on:
- 46 data points for sample 1 (1 outliers removed)
- 2252 data points for sample 2 (14 outliers removed)
- 2123 data points for sample 3 (11 outliers removed)
- 1249 data points for sample 4 (1893 outliers removed)
- 4306 data points for sample 5 (117 outliers removed)
- 6816 data points for sample 6 (488 outliers removed)
- 4305 data points for sample 7 (2784 outliers removed)
Calculating RT linking tolerance bins...
RT_bin_start, Tolerance
58.4147, 242.107
1638.33, 276.325
...
6957.7, 2461.47
6976.28, 2292.36
6998.41, 1087.48
7032.68, 1632.92
7057.85, 2193.84
7076.39, 2258.63
7098.71, 2085.63
7119.56, 2266.9
7151.04, 1172.92
7170.17, 1165.37
7191.98, 1126.44
Is it possible to get the consensusXML output @daichengxin ? I want to take a look at the intensity and RT deviation distribution of the blank
In my experience a higher deviation is normal towards the end of the RT. But yes sure they are even higher in this sample. Unfortunately usually you don't have a known blank to tweak this so an FDR approach is desperately needed. How do we set the linking tolerance again?
What about reactivating the SVM scoring of FFID as a first step?
I actually with @jpfeuffer in this one because in that experiment the RT window is quite high 20 min.
In my experience a higher deviation is normal towards the end of the RT. But yes sure they are even higher in this sample. Unfortunately usually you don't have a known blank to tweak this so an FDR approach is desperately needed. How do we set the linking tolerance again?
AFAIK based on the maximum difference observed.
What about reactivating the SVM scoring of FFID as a first step?
It is worth a try, but I don't know what the best way is (e.g., it is pairwise). Maybe we can meet and discuss this a bit in more detail.
Ok I see the need for FDR filtering but intensity-wise most of the found features (about 97%) in the blank are at the noise level.
Is there an SNR value for our features?
Can we expose this as a filter?
Is there an SNR value for our features?
Hmm not really. We might need to be careful to not filter out the true (but probably also weak) single cell signals. Simple ideas appreciated.
I can only think of decoy strategies :) calculate SNRs for decoys and target. Use IDPEP or better a non-parameteric version of it. Then, in the simplest form, just use a probability cutoff.
At least, one would save the pairwise comparisons to transfer decoys like in other strategies. But probably at a cost, since the similarity between two features in different runs, probably helps as a discriminant score.
@ypriverol @timosachsenberg @jpfeuffer. Here are new results after major changes in the code of openms for MBR. Results folder: http://ftp.pride.ebi.ac.uk/pub/databases/pride/resources/proteomes/quantms-benchmark/PXD016921-MBR-newSVM/
Cool. The ratios look good. Which SVM probability cutoff was used? Can you try multiple?
feature_with_id_min_score = 0.25
feature_without_id_min_score = 0.75
https://github.com/bigbio/quantms/blob/06b9529abf10a65ee3753ea14c5fee248aa39564/nextflow.config#L158
@daichengxin how do you count the peptides? I did some quick check:
cut -d',' -f 2,11 notransfer_1.0_1000_star.csv | sort | uniq | grep Blank | wc -l
cut -d',' -f 2,11 notransfer_1.0_1000_star.csv | sort | uniq | grep Singlecell2 | wc -l
cut -d',' -f 2,11 notransfer_1.0_1000_star.csv | sort | uniq | grep Singlecell3 | wc -l
cut -d',' -f 2,11 notransfer_1.0_1000_star.csv | sort | uniq | grep Singlecell4 | wc -l
cut -d',' -f 2,11 notransfer_1.0_1000_star.csv | sort | uniq | grep 20HeLa | wc -l
cut -d',' -f 2,11 notransfer_1.0_1000_star.csv | sort | uniq | grep 100HeLa | wc -l
98
3877
3208
3082
4612
9042
11513
These numbers seem a bit higher.
Even if I do | grep -v "(" |
to filter out modified I get way higher numbers.
What am I doing differently?
@timosachsenberg I used the msstats_in file, removed the modification and only counted the unique peptides
.
like
cut -d',' -f 2,11 msstats.csv | sed 's/([^)]*)//g' | sort | uniq | grep Singlecell4 | wc -l
? Do you have a command?
These are my values with filtering out modified ones:
Blank 97
Singlecell1 3827
Singlecell2 3172
Singlecell3 3061
Singlecell4 4533
HeLa 8848
100HeLa 11357
I think he means unique as in "unique for a protein". But IIRC we tried to only export unique-to-group peptides to MSstats so Timo's approach might be correct.
You need to check what MQ exports and how it defines uniqueness if you want to compare by #identifications. You can check by picking some of the group-unique peptides and see how MQ defines them. Ideally one would just compare the number of features with an ID though. I have no idea if you can do that with MQ.
Here is my script. And i try to run @timosachsenberg command, but got different values from @timosachsenberg.
def sub_mod(peptide):
peptide = peptide.replace(".", "")
peptide = re.sub(r"\(.*?\)", "", peptide)
return peptide
quantms = pd.read_csv("./PXD016921MBRnewSVM/PXD016921-MBR.sdrf_openms_design_msstats_in.csv", sep=',', header=0)
quantms = quantms[-quantms['ProteinName'].str.contains("DECOY_|CONTAMINANT_|REV_|BOVIN")]
quantms["sequence"] = quantms.apply(lambda x: sub_mod(x["PeptideSequence"]), axis=1)
msstats_in_data = quantms.groupby('sequence').filter(lambda x: len(set(x["ProteinName"])) == 1)
print(quantms.groupby("Reference")["sequence"].nunique())
print(quantms.groupby("Reference")["ProteinName"].nunique())
How is your MQ script? And can we put the MQ data next to the quantms results for both UPS and this one? Edit: Ah I see the MQ files in the beginning of this thread.
MQ = pd.read_csv("./MQsearchresults_Single_cell/txt/peptides.txt", sep="\t")
# MQ = MQ[(MQ["Reverse"] != "+")& (MQ["Potential contaminant"] != "+") & (MQ["Unique (Groups)"] == "yes")]
MQ = MQ[(MQ["Reverse"] != "+")& (MQ["Potential contaminant"] != "+")]
print(MQ[-MQ["Identification type Single cell 1"].isna()]["Sequence"].nunique())
print(MQ[-MQ["Identification type Single cell 2"].isna()]["Sequence"].nunique())
print(MQ[-MQ["Identification type Single cell 3"].isna()]["Sequence"].nunique())
print(MQ[-MQ["Identification type Single cell 4"].isna()]["Sequence"].nunique())
MQ = pd.read_csv("./MQsearchresults_Single_cell/txt/proteinGroups.txt", sep="\t")
MQ = MQ[(MQ["Reverse"] != "+")& (MQ["Potential contaminant"] != "+")]
# MQ = MQ[MQ["Unique peptides"] > 0]
print(MQ[-MQ["Identification type Single cell 1"].isna()]["Protein IDs"].nunique())
print(MQ[-MQ["Identification type Single cell 2"].isna()]["Protein IDs"].nunique())
print(MQ[-MQ["Identification type Single cell 3"].isna()]["Protein IDs"].nunique())
print(MQ[-MQ["Identification type Single cell 4"].isna()]["Protein IDs"].nunique())
print(MQ[-MQ["Identification type Blank"].isna()]["Protein IDs"].nunique())
print(MQ[-MQ["Identification type 20 HeLa cells"].isna()]["Protein IDs"].nunique())
print(MQ[-MQ["Identification type 100 HeLa cells"].isna()]["Protein IDs"].nunique())
0.1&0.9 results. Results folder: http://ftp.pride.ebi.ac.uk/pub/databases/pride/resources/proteomes/quantms-benchmark/PXD016921-MBR-newSVM/
Description of the bug
Analyzing single cell dataset PXD016291. The results folder can be found in here. The MBR is enabled. The results looks quite different from original paper. The less proteins are quantified in single cell samples than original results. And more proteins are misquantified in blank sample than original results. The original paper used MaxQuant with as follows parameters:
Command used and terminal output
Relevant files
Original MaxQuant results quantms results
System information
No response