Single cell analysis performance with worse results

daichengxin commented 1 year ago

Description of the bug

Analyzing single cell dataset PXD016291. The results folder can be found in here. The MBR is enabled. The results looks quite different from original paper. The less proteins are quantified in single cell samples than original results. And more proteins are misquantified in blank sample than original results. The original paper used MaxQuant with as follows parameters:

All raw files were processed using MaxQuant (version 1.6.3.3) for feature detection, database searching, and protein/peptide quantification. MS/MS spectra were searched against the UniProtKB/Swiss-Prot human database (downloaded on October 26, 2018, containing 20 397 reviewed sequences). N-Terminal protein acetylation and methionine oxidation were selected as variable modifications. Carbamidomethylation of cysteine residues was set as a fixed modification. The peptide mass tolerances of the first search and main search (recalibrated) were <30 and 5 ppm, respectively. The minimum peptide length was six amino acids, and the maximum peptide mass was 4600 Da. Only two missed cleavages were allowed for each peptide. The second peptide search was activated to identify coeluting and cofragmented peptides from one MS/MS spectrum. Both peptides and proteins were filtered with a maximum false discovery rate (FDR) of 0.01. The MBR feature with a matching window of 0.7 min and an alignment window of 20 min, was activated. Label-free quantitation (LFQ) calculations were performed separately in each parameter group containing similar cell loadings. Both unique and razor peptides were selected for protein quantification. Other unmentioned parameters were the MaxQuant default settings. Potential contaminants and reverse sequences were filtered out.

QQ截图20230919173108

Command used and terminal output

$ nextflow run /hps/nobackup/juan/pride/reanalysis/quantms/main.nf -c /hps/nobackup/juan/pride/reanalysis/quantms/nextflow.config -profile ebislurm --input PXD016921-MBR.sdrf.tsv --search_engines comet,msgf --root_folder /hps/nobackup/juan/pride/reanalysis/single-cell/PXD016921/ --local_input_type raw --outdir PXD016921-MBR --database /hps/nobackup/juan/pride/reanalysis/multiomics-configs/databases/Homo-sapiens-uniprot-reviewed-contaminants-decoy-202210.fasta --protein_level_fdr_cutoff 0.01 --psm_level_fdr_cutoff 0.01 --quantify_decoys true --transfer_ids mean --targeted_only false --skip_post_msstats false --enable_pmultiqc true -resume

Relevant files

Original MaxQuant results quantms results

System information

No response

ypriverol commented 1 year ago

@daichengxin can you add some description about How the data was analyzed with MaxQuant and the original paper.

jpfeuffer commented 1 year ago

Thank you for the analysis. Very interesting. Can we maybe run the analysis with looser FDR cutoffs to see if it's more an identification problem or a quantification problem?

ypriverol commented 1 year ago

@jpfeuffer a couple of ideas:

In single cell, looks for me that MBR would be doing most of the job for the single cell runs. If you see we ID most of the peptides in the 20 and 100 samples (complex samples), more than MQ. Those samples are used as "channels" to look for poor signals from the single cell files. My point is the following:

The MBR feature with a matching window of 0.7 min and an alignment window of 20 min, was activated.

Do you think this 20 min window that they use for MBR in MQ can be the difference with us?

Do you think we need to do something in the experimental design.? For the MBR we are annotating every sample as a biological replicate to enable MBRs.

jpfeuffer commented 1 year ago

Yes, if we find out that it's not the identification performance we will check quantification. But if you have only 100 proteins that pass FDR, you cannot quantify more than that and it is useless to check quantification.

ypriverol commented 1 year ago

@daichengxin can we check in the output of MQ of the original project how many peptides from the single cell samples are quantify/identified using MBR, I think MQ use a notation like No MS/MS or something like that.

daichengxin commented 1 year ago

More than half of the unique peptides originated with the MBR.

ypriverol commented 1 year ago

This is what I thought, see @jpfeuffer The majority of peptides are not coming from IDs but from the MBR, then we should see how we fail to do proper MBR in our side, my previous two comments.

jpfeuffer commented 1 year ago

Still.. if they don't pass FDR, there is nothing to quantify. So we have to wait for the ID analysis

daichengxin commented 1 year ago

Almost all of the peptides quantified only in MQ are from MBR. However, for quantms, these peptides also are only quantified in other files rather than from same file. it suggest we fail to do proper MBR, right?

jpfeuffer commented 1 year ago

Can you please add the command you used for quantms?

daichengxin commented 1 year ago

nextflow run /hps/nobackup/juan/pride/reanalysis/quantms/main.nf -c /hps/nobackup/juan/pride/reanalysis/quantms/nextflow.config -profile ebislurm --input PXD016921-MBR.sdrf.tsv --search_engines comet,msgf --root_folder /hps/nobackup/juan/pride/reanalysis/single-cell/PXD016921/ --local_input_type raw --outdir PXD016921-MBR --database /hps/nobackup/juan/pride/reanalysis/multiomics-configs/databases/Homo-sapiens-uniprot-reviewed-contaminants-decoy-202210.fasta --protein_level_fdr_cutoff 0.01 --psm_level_fdr_cutoff 0.01 --quantify_decoys true --transfer_ids mean --targeted_only false --skip_post_msstats false --enable_pmultiqc true -resume @jpfeuffer

jpfeuffer commented 1 year ago

I don't know. I can't see any signs of transfer in the Proteomicslfq logs. Do you have the call to proteomicslfq?

ypriverol commented 1 year ago

The command with the relapsed FDR is:

nextflow run /hps/nobackup/juan/pride/reanalysis/quantms/main.nf -c /hps/nobackup/juan/pride/reanalysis/quantms/nextflow.config -profile ebislurm --input PXD016921-MBR.sdrf.tsv --search_engines comet,msgf --root_folder /hps/nobackup/juan/pride/reanalysis/single-cell/PXD016921/ --local_input_type raw --outdir PXD016921-MBR --database /hps/nobackup/juan/pride/reanalysis/multiomics-configs/databases/Homo-sapiens-uniprot-reviewed-contaminants-decoy-202210.fasta --protein_level_fdr_cutoff 0.15 --psm_level_fdr_cutoff 0.15 --quantify_decoys true --transfer_ids mean --targeted_only false --skip_post_msstats false --enable_pmultiqc true -resume

Here the results: http://ftp.pride.ebi.ac.uk/pub/databases/pride/resources/proteomes/quantms-benchmark/PXD016921-MBR/

jpfeuffer commented 1 year ago

I found the plfq command. Looks correct. One thing is, that we only transfer_ids if they occur in more than 50% of the samples. I don't know how MQ does it.

daichengxin commented 1 year ago

@jpfeuffer @ypriverol ran the data with a more relaxed FDR 0.15 at Protein and PSM level. However, the number of quantified peptides did not increase significantly. Similarily, almost all of the peptides quantified only in MQ are from MBR. For quantms, these peptides also are only quantified in other files rather than from same file.

jpfeuffer commented 1 year ago

I think if we loosen the percentage of aligned samples required to trigger requantification, we should think about an FDR approach for transferred quantification first. E.g. shifted decoy EICs

jpfeuffer commented 1 year ago

You can use the new container tomorrow to play around with a parameter that controls this proportion. You need to expose it quickly in the pipeline.

ypriverol commented 1 year ago

Can you let me know what is the parameter that we will need to use.

jpfeuffer commented 1 year ago

id_transfer_threshold

ypriverol commented 1 year ago

What is the default value?

timosachsenberg commented 1 year ago

0.5

ypriverol commented 1 year ago

@timosachsenberg @jpfeuffer @daichengxin I ran again the data with the same parameters but now with id_transfer_threshold as 0.10 I don't see any improvements in the results. Can you double check @daichengxin?

Here the data: http://ftp.pride.ebi.ac.uk/pub/databases/pride/resources/proteomes/quantms-benchmark/PXD016921-MBR/

daichengxin commented 1 year ago

Wow! The number os unique peptides and proteins have increased a lot. But false positives have also increased. Further adjustment of id_transfer_threshold may be needed?

ypriverol commented 1 year ago

Can you add the False positives and the MaxQuant results in the table?

jpfeuffer commented 1 year ago

To be honest, without FDR control the settings we can do will not be detailed enough. Sure you could require at least two samples with IDs but it will always pick up everything in the blank sample even if the correlation etc is complete trash.

ypriverol commented 1 year ago

To be honest, without FDR control the settings we can do will not be detailed enough. Sure you could require at least two samples with IDs but it will always pick up everything in the blank sample even if the correlation etc is complete trash.

I know, I want to see how much False positives do we have when moving over some thresholds.

Do you have an idea about which kind of algorithm we can implement to really control the FDR for this type of experiment?

timosachsenberg commented 1 year ago

The runs have quite a big alignment error for some masses. I will take a look at the consensusXML. Maybe if we apply stricter RT filtering on transferred quants, we already get rid of those false/spurious links from the blank sample.

Alignment based on:
- 46 data points for sample 1 (1 outliers removed)
- 2252 data points for sample 2 (14 outliers removed)
- 2123 data points for sample 3 (11 outliers removed)
- 1249 data points for sample 4 (1893 outliers removed)
- 4306 data points for sample 5 (117 outliers removed)
- 6816 data points for sample 6 (488 outliers removed)
- 4305 data points for sample 7 (2784 outliers removed)

Calculating RT linking tolerance bins...
RT_bin_start, Tolerance
58.4147, 242.107
1638.33, 276.325
...
6957.7, 2461.47
6976.28, 2292.36
6998.41, 1087.48
7032.68, 1632.92
7057.85, 2193.84
7076.39, 2258.63
7098.71, 2085.63
7119.56, 2266.9
7151.04, 1172.92
7170.17, 1165.37
7191.98, 1126.44

timosachsenberg commented 1 year ago

Is it possible to get the consensusXML output @daichengxin ? I want to take a look at the intensity and RT deviation distribution of the blank

ypriverol commented 1 year ago

http://ftp.pride.ebi.ac.uk/pub/databases/pride/resources/proteomes/quantms-benchmark/PXD016921-MBR/proteomicslfq/PXD016921-MBR.sdrf_openms_design_openms.consensusXML

jpfeuffer commented 1 year ago

In my experience a higher deviation is normal towards the end of the RT. But yes sure they are even higher in this sample. Unfortunately usually you don't have a known blank to tweak this so an FDR approach is desperately needed. How do we set the linking tolerance again?

What about reactivating the SVM scoring of FFID as a first step?

ypriverol commented 1 year ago

I actually with @jpfeuffer in this one because in that experiment the RT window is quite high 20 min.

timosachsenberg commented 1 year ago

In my experience a higher deviation is normal towards the end of the RT. But yes sure they are even higher in this sample. Unfortunately usually you don't have a known blank to tweak this so an FDR approach is desperately needed. How do we set the linking tolerance again?

AFAIK based on the maximum difference observed.

What about reactivating the SVM scoring of FFID as a first step?

It is worth a try, but I don't know what the best way is (e.g., it is pairwise). Maybe we can meet and discuss this a bit in more detail.

timosachsenberg commented 1 year ago

Ok I see the need for FDR filtering but intensity-wise most of the found features (about 97%) in the blank are at the noise level.

jpfeuffer commented 1 year ago

Is there an SNR value for our features?

ypriverol commented 1 year ago

Can we expose this as a filter?

timosachsenberg commented 1 year ago

Is there an SNR value for our features?

Hmm not really. We might need to be careful to not filter out the true (but probably also weak) single cell signals. Simple ideas appreciated.

jpfeuffer commented 1 year ago

I can only think of decoy strategies :) calculate SNRs for decoys and target. Use IDPEP or better a non-parameteric version of it. Then, in the simplest form, just use a probability cutoff.

jpfeuffer commented 1 year ago

At least, one would save the pairwise comparisons to transfer decoys like in other strategies. But probably at a cost, since the similarity between two features in different runs, probably helps as a discriminant score.

daichengxin commented 1 year ago

@ypriverol @timosachsenberg @jpfeuffer. Here are new results after major changes in the code of openms for MBR. Results folder: http://ftp.pride.ebi.ac.uk/pub/databases/pride/resources/proteomes/quantms-benchmark/PXD016921-MBR-newSVM/

jpfeuffer commented 1 year ago

Cool. The ratios look good. Which SVM probability cutoff was used? Can you try multiple?

ypriverol commented 1 year ago

feature_with_id_min_score = 0.25
feature_without_id_min_score = 0.75

https://github.com/bigbio/quantms/blob/06b9529abf10a65ee3753ea14c5fee248aa39564/nextflow.config#L158

timosachsenberg commented 1 year ago

@daichengxin how do you count the peptides? I did some quick check:

cut -d',' -f 2,11 notransfer_1.0_1000_star.csv | sort | uniq | grep Blank | wc -l
cut -d',' -f 2,11 notransfer_1.0_1000_star.csv | sort | uniq | grep Singlecell2 | wc -l
cut -d',' -f 2,11 notransfer_1.0_1000_star.csv | sort | uniq | grep Singlecell3 | wc -l
cut -d',' -f 2,11 notransfer_1.0_1000_star.csv | sort | uniq | grep Singlecell4 | wc -l
cut -d',' -f 2,11 notransfer_1.0_1000_star.csv | sort | uniq | grep 20HeLa | wc -l
cut -d',' -f 2,11 notransfer_1.0_1000_star.csv | sort | uniq | grep 100HeLa | wc -l

98
3877
3208
3082
4612
9042
11513

These numbers seem a bit higher. Even if I do | grep -v "(" | to filter out modified I get way higher numbers.

What am I doing differently?

daichengxin commented 1 year ago

@timosachsenberg I used the msstats_in file, removed the modification and only counted the unique peptides.

timosachsenberg commented 1 year ago

like

cut -d',' -f 2,11 msstats.csv |  sed 's/([^)]*)//g' | sort | uniq | grep Singlecell4 | wc -l

? Do you have a command?

These are my values with filtering out modified ones:

Blank 97
Singlecell1 3827
Singlecell2 3172
Singlecell3 3061
Singlecell4 4533
HeLa 8848
100HeLa 11357

jpfeuffer commented 1 year ago

I think he means unique as in "unique for a protein". But IIRC we tried to only export unique-to-group peptides to MSstats so Timo's approach might be correct.

jpfeuffer commented 1 year ago

You need to check what MQ exports and how it defines uniqueness if you want to compare by #identifications. You can check by picking some of the group-unique peptides and see how MQ defines them. Ideally one would just compare the number of features with an ID though. I have no idea if you can do that with MQ.

daichengxin commented 1 year ago

Here is my script. And i try to run @timosachsenberg command, but got different values from @timosachsenberg.

def sub_mod(peptide):
    peptide = peptide.replace(".", "")
    peptide = re.sub(r"\(.*?\)", "", peptide)
    return peptide

quantms = pd.read_csv("./PXD016921MBRnewSVM/PXD016921-MBR.sdrf_openms_design_msstats_in.csv", sep=',', header=0)
quantms = quantms[-quantms['ProteinName'].str.contains("DECOY_|CONTAMINANT_|REV_|BOVIN")]
quantms["sequence"] = quantms.apply(lambda x: sub_mod(x["PeptideSequence"]), axis=1)
msstats_in_data = quantms.groupby('sequence').filter(lambda x: len(set(x["ProteinName"])) == 1)
print(quantms.groupby("Reference")["sequence"].nunique())
print(quantms.groupby("Reference")["ProteinName"].nunique())

jpfeuffer commented 1 year ago

How is your MQ script? And can we put the MQ data next to the quantms results for both UPS and this one? Edit: Ah I see the MQ files in the beginning of this thread.

daichengxin commented 1 year ago

MQ = pd.read_csv("./MQsearchresults_Single_cell/txt/peptides.txt", sep="\t")
# MQ = MQ[(MQ["Reverse"] != "+")& (MQ["Potential contaminant"] != "+") & (MQ["Unique (Groups)"] == "yes")]
MQ = MQ[(MQ["Reverse"] != "+")& (MQ["Potential contaminant"] != "+")]
print(MQ[-MQ["Identification type Single cell 1"].isna()]["Sequence"].nunique())
print(MQ[-MQ["Identification type Single cell 2"].isna()]["Sequence"].nunique())
print(MQ[-MQ["Identification type Single cell 3"].isna()]["Sequence"].nunique())
print(MQ[-MQ["Identification type Single cell 4"].isna()]["Sequence"].nunique())

MQ = pd.read_csv("./MQsearchresults_Single_cell/txt/proteinGroups.txt", sep="\t")
MQ = MQ[(MQ["Reverse"] != "+")& (MQ["Potential contaminant"] != "+")]
# MQ = MQ[MQ["Unique peptides"] > 0]
print(MQ[-MQ["Identification type Single cell 1"].isna()]["Protein IDs"].nunique())
print(MQ[-MQ["Identification type Single cell 2"].isna()]["Protein IDs"].nunique())
print(MQ[-MQ["Identification type Single cell 3"].isna()]["Protein IDs"].nunique())
print(MQ[-MQ["Identification type Single cell 4"].isna()]["Protein IDs"].nunique())
print(MQ[-MQ["Identification type Blank"].isna()]["Protein IDs"].nunique())
print(MQ[-MQ["Identification type 20 HeLa cells"].isna()]["Protein IDs"].nunique())
print(MQ[-MQ["Identification type 100 HeLa cells"].isna()]["Protein IDs"].nunique())

daichengxin commented 1 year ago

0.1&0.9 results. Results folder: http://ftp.pride.ebi.ac.uk/pub/databases/pride/resources/proteomes/quantms-benchmark/PXD016921-MBR-newSVM/

bigbio / quantms