Considerably less clusters when scaling is applied

KilianMaes commented 3 years ago

Hello,

I'm using falcon to cluster a large proteomic dataset (20M spectra). I've compared the number of clusters generated depending on the scaling method applied on the peaks (root, log, rank or none), and I get very different results :

Parameters

I kept all the parameters by default excepted : eps = 0.35 charge = (2, 3, 4, 5)

Results for ~39 000 MS2 spectra (one fraction)

Scaling	# clustered spectra	% clustered spectra	# clusters
root	148	0.38%	54
log	2	0.01%	1
rank	2	0.01%	1
None	883	2.24%	359

Results for ~460 000 MS2 spectra (12 fractions)

Scaling	# clustered spectra	% clustered spectra	# clusters
root	4642	1.00%	1656
log	26	0.01%	12
rank	39	0.01%	19
None	34519	7.45%	13336

Discussion

I expected to get a different number of clusters depending on the scaling method applied on the peaks, but I think that the difference is very large for my dataset. Moreover, even when no scaling is applied (and the number of clusters is the largest), only a few percent of spectra are being clustered, which is far from the order of magnitude in the results published, where the proportion of clustered spectra ranges from ~20% to ~60% (with the 'rank' scaling).

@bittremieux did you observe similar results during your experiments on some dataset? Do you have some idea that could explain the difference presented above? I would be very grateful... If needed, I can upload the files in order to reproduce these results.

Cc @lgatto

bittremieux commented 3 years ago

This is very strange. Scaling should not have a massive influence; it only helps to de-emphasize overly dominant peaks. But irrespective of scaling, the intensities are normalized so that the spectra are converted to unit vectors. I did not explicitly evaluate different scaling methods here, but have in the past done comparisons that showed that rank-based scaling outperformed square root scaling. So normally I wouldn't expect scaling to have such a big difference in the number of clusters and clustered spectra.

There seems to be a bigger problem though: the number of clustered spectra is very low, especially with the hyperparameters you use. It makes sense that the more spectra you have, the more spectra will be clustered, because it will be more likely that there are similar spectra. Your numbers are extremely low though. That, and the big differences between scaling methods, shows that something weird is going on. To diagnose the problem it would help to have access to the same data, so if this is a public dataset or if you can share it, that would be useful. Thanks.

bittremieux commented 3 years ago

Ok, I can reproduce differences that are larger than expected between different scaling methods. I haven't been able to find a clear bug, so I'll need to dig a bit more to figure out whether this is normal or not.

KilianMaes commented 3 years ago

Thank you for taking a look at these results. My results were obtained on the data of the quantitative proteomics of the CCLE (https://pubmed.ncbi.nlm.nih.gov/31978347/). I think the files are not directly available, but you can download the mzML's of the first 12 fractions here if needed : https://filesender.belnet.be/?s=download&token=9cc70839-bb8d-42d4-866b-a35931f7537b (the link expires in 7 days).

I'm also trying to figure out what could be the problem, but I haven't found any explanation yet.

Cc @lgatto

bittremieux commented 3 years ago

There's a significant difference in the nearest neighbor distances depending on what type of scaling is used (45,492 spectra of charge 2; 569,294 non-zero pairwise distances): nn_dist_scaling

In blue pairwise nearest neighbor distances between spectra that were assigned identical peptide sequences, in dark pink pairwise nearest neighbor distances between spectra that were assigned different peptide sequences. Note that the low dark pink distances might be pairs of spectra that only differ in the nearby location of the same modification or where one spectrum couldn't be identified at 1% FDR.

Rank and root scaling are pretty similar, no scaling clearly gives the best separation between spectra with identical/different peptide assignments, and log scaling is just bad.

As the scaling methods result in larger pairwise distances for spectra corresponding to the same peptides, this will have an influence on the number of clusters that can be found because there might be less spectrum pairs within the distance threshold.

I'm keeping the option to scale the intensities during spectrum preprocessing, but I'm setting the default value to no scaling as this clearly seems superior. The bad news is that I should have evaluated this properly, the good news is that falcon can probably perform better than the results reported in the preprint because I used rank scaling there.

There's still an issue that you get a pretty low number of clusters and clustered spectra. I'll try to have a look at your data tomorrow.

KilianMaes commented 3 years ago

Thanks for these explanations, it's very clear. I am impatient to hear your opinion on the results obtained on the CCLE dataset.

bittremieux commented 3 years ago

It seems like that that CCLE data is just weird. I ran clustering with slightly different parameters for more exhaustive spectrum matching to ensure that no neighboring spectra were missed. A similar plot as above: ccle

No coloration this time because I don't have PSMs, but the plot speaks for itself. Only very few spectra with low pairwise distance, while the vast majority of spectra pairs are almost completely different.

So based on that plot it makes total sense that barely any clusters are found as there are hardly any similar spectra. It's interesting that this observation is so stark — does that correspond with the results you get from sequence database searching or alternative clustering tools?

lgatto commented 3 years ago

Thank you @bittremieux for looking into this.

These are TMT 16-plex data, with a 3rd MS level. I don't think this could explain the weirdness, does it?

bittremieux commented 3 years ago

I'm not familiar with the characteristics of that type of data. Would you cluster the MS3 level instead of the MS2s then? Or both MS2 and MS3?

falcon reads the MS2 spectra from mzML/mzXML files. If I read only the MS3 spectra from the mzML files that were shared instead, I get quite different results:

2021-03-12 16:41:16,886 INFO [spectrum_clustering/MainProcess] falcon.main : Export cluster assignments of 132017 spectra to 33547 unique clusters

132,017 / 176,389 = 75% spectra clustered. This is much more like what I'd expect with a 0.35 cosine distance threshold.

In this case it actually seems that there are barely any dissimilar spectra for all 64-nearest neighbors per spectrum: nn_dist

So it seems we've found the problem. I can make the MS level a user parameter, but what would be the best option to use as a default? MS2 spectra or MSn spectra (so not MS1 but everything else)?

lgatto commented 3 years ago

Would you cluster the MS3 level instead of the MS2s then? Or both MS2 and MS3?

MS2 data, given that's the level the spectra are id'ed on. MS3 is for quantitation only, focusing on the 16 common TMT peaks. This would also explain that there are barely any dissimilar spectra.

So it seems that the problem isn't solved yet :-/

bittremieux commented 3 years ago

Oh ok, that's too bad. ☹️ Based on the density plot there are very few similar MS2 spectra. Do you also have the identification data? That way we'd be able to check whether the identified peptides are almost entirely unique per spectrum. That would be a bit weird but maybe not entirely inconceivable in case of extreme fractionation and dynamic exclusion?

lgatto commented 3 years ago

Yes, I have the identification data. Will explore.

Actually, there's another particular thing with TMT data, as opposed to label free. Each MS2 spectrum is in fact a mix of the same peptide stemming from 16 different combined samples. Not sure if this would explain the weird results though.

KilianMaes commented 3 years ago

So based on that plot it makes total sense that barely any clusters are found as there are hardly any similar spectra. It's interesting that this observation is so stark — does that correspond with the results you get from sequence database searching or alternative clustering tools?

I've used spectra-cluster on a subset of the CCLE data to compare the output with the one of falcon. Here are the results.

Parameters

I made vary the parameters 'precursor_tolerance' (I used the same values as in the falcon preprint) and kept the default parameters excepted :

precursor_tolerance = 20 ppm

(by default, 5 rounds are made) I used MSConvert (ProteoWizard) to convert the mzML to mgf files (I only kept the MS2 spectra).

Results for ~39 000 MS2 spectra (one fraction) clustered by spectra-cluster

threshold_end	# clustered sp	% clustered sp	# clusters	av (# sp)/cluster
0.99999	1145	2.92%	525	2.18
0.9999	2920	7.44%	1318	2.22
0.999	6453	16.43%	2820	2.29
0.99	11580	29.49%	4792	2.42
0.95	15313	39.00%	6149	2.49
0.9	16936	43.13%	6727	2.52
0.8	18461	47.02%	7209	2.56
0.7	19272	49.08%	7478	2.58

Results for ~460 000 MS2 spectra (12 fractions) clustered by spectra-cluster

threshold_end	# clustered sp	% clustered sp	# clusters	av (# sp)/cluster
0.99999	10481	2.26%	4546	2.31
0.9999	39278	8.48%	16253	2.42
0.999	86854	18.74%	34310	2.53
0.99	160811	34.70%	57768	2.78
0.95	230385	49.71%	70536	3.27
0.9	262192	56.57%	74667	3.51
0.8	295242	63.70%	78084	3.78
0.7	314597	67.88%	79439	3.96

@bittremieux it seems that the proportion of clustered spectra is larger with spectra-cluster. However I'm not totally sure how to interpret these results, because the algorithms are different, as well as the parameters (eps for falcon and threshold_end for spectra-cluster). Based on your previous comparison of these two algorithms, do you think that the proportion of clustered spectra with falcon is anormally low compared to spectra-cluster? Do you think anything other than the data itself could explain this difference?

Cc @lgatto

bittremieux commented 3 years ago

Thanks for the detailed follow-up. These spectra-cluster results look much more sensible. Here you see an increasing number of clustered spectra as the similarity threshold is relaxed, as you'd expect.

Meanwhile, in the previous histogram with falcon similarities you can see that relaxing the eps threshold will hardly make a difference because there are barely any similar spectra. This is unexpected and indicates a problem with falcon, which the spectra-cluster results seem to confirm.

I'll need to check in full detail what's going on during the spectrum similarity computations and step through the data to figure out why falcon is giving these weird results. That'll probably take me some time.

KilianMaes commented 3 years ago

Hello, I've downloaded the data set you used to test falcon (PXD000561) and I've made some comparisons with the data set I'm using, CCLE. More precisely :

I plotted the distribution of the precursor mz's among the 1 mz buckets.
I checked the error caused by the conversion from high dimensional vectors (HD) to low dimensional vectors (LD).
I compared the distances obtained by falcon with the distances obtained with MS-clustering (cosine distance using a "sliding window").

Data sets

For CCLE, I used the 12 same fractions (called Prot_01 in the figures below) used for the previous comparisons (the ones we sent you two weeks ago). For PXD000561, I've used the 35 fractions of "Fetal_Liver_bRP_Elite_22". Such that, the number of spectra of charge 2 is almost the same. spectra_per_charge

Clusters' stats

On the following plots, I used falcon to cluster the two data sets and I made vary the eps parameter of DBSCAN. I kept all the other parameters by default (including scaling=None). We can observe the low proportion of spectra in clusters (a cluster = at least two spectra), and the low number of clusters. Note that for CCLE, the size of the clusters (the average number of spectra per cluster) increases when we relax eps, I assume this is because most of the spectra clustered only for eps > 0.05 are in clusters of size 2. stats

Methodology

In order to compare the distances generated by falcon and the distances generated with the cosine distance s.t. it is defined in MS-Clustering (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2533155/, "Supplementary material"), I've compared all the pairs of spectra respecting the constraint on the precursor mz's. To avoid having to load all the spectra in memory at the same time, I used a sliding window such that only the current spectra and all the spectra having a similar precursor mz must be kept in memory at the same time.

To verify my implementation, I've compared the sparse distance matrix generated by falcon with the sparse distance matrix I generated directly from the .pkl files using a brute-force approach for a small data set (less than 30 spectra per bucket, such that all the neighbors found by falcon are kept, we don't need to keep only the n_neighbors nearest neighbors). Note that the brute-force distance matrix only contains distances for spectra respecting the constraint on the precursor mz. I've observed that :

The two distance matrices have the same size.
All the entries from the falcon distance matrix are in the brute-force distance matrix.
The number of entries in the brute-force matrix that are not in the falcon matrix corresponds to the number of pairs of spectra that have similar precursor mz's but are in different buckets (because the precursor mz is very close to an integer).

Based on these observations, I assume my implementation of the sliding window is correct. If you want to take a look, the code I used to generate the plots is available here :

https://github.com/KilianMaes/falcon/blob/invest_falcon/notebooks/compareDistanceMatrices.ipynb
https://github.com/KilianMaes/falcon/blob/invest_falcon/notebooks/nb_utils.py (I haven't had the time to comment properly the code yet).

Precursor mz's distribution (for charge 2)

The distribution of the spectra in the 1 mz buckets : precursor_mzs_charge2 Note that in both data sets, there are never more than 1000 spectra per bucket, which means that falcon won't partition the buckets but use a brute force approach instead.

All the plots below were computed for the spectra of charge 2.

HD vectors to LD vectors

I've plotted the cosine distance computed from the HD vectors (before the dimensionality reduction) and the cosine distance stored in the matrix computed by falcon. Because the plot is not completely readable, I've added a 2D histogram. Some cells (bottom left and top right) contained many elements and made the rest of the 2D histogram appear uniform, so I've excluded these cells with the cmax parameter of pyplot (= the white cells). hdvectors_vs_falcon_charge2

Note that the horizontal line at the top corresponds to the pairs of spectra that are not in the same bucket (~1.5% of the pairs with similar precursor mz's for both datasets) : the distance matrix computed by falcon doesn't contain an entry for them and the distance is assumed to be 1. From these plots, it seems that there are many more pairs of spectra in different buckets for PXD000561, but this is not the case : for CCLE, the pairs of spectra tend to be more dissimilar, thus the points corresponding to these points are in the top-right corner and we cannot see them.

We can see a strong correlation between the distance computed from the HD vectors and from the LD vectors : the assumption that the Euclidean distance is conserved during the dimensionality reduction seems correct for a high proportion of pairs of spectra. The points at the bottom right indicate spectra for which falcon underestimates the distance due to the dimensionality reduction. That could explain having some clusters with disimilar spectra.

MS-Clustering vs falcon

I've done the same with the MS-Clustering distance : msclustering_vs_falcon_charge2

We can also observe the correlation between the two distance functions.

Comparison of the different distance functions

I've made a box plot of the distances stored in the distance matrix (excepted for the diagonal) for each of the distance functions presented above. dist_distr_measures_charge2 Even though there are small variations between these measures, we can clearly see that in the CCLE dataset, the distances tend to be significantly higher than for PXD000561. In conclusion, I don't think that the low proportion of clusters is directly related to falcon, but to intrinsic characteristics of the CCLE data set. In spectra-cluster, the distance function is not the cosine distance but a probabilistic measure, maybe that could explain the higher proportion of clustered spectra with this tool?

bittremieux commented 3 years ago

Impressive analysis, amazing job.

Yes, I agree with your conclusion: it seems that the spectra in the CCLE dataset are just very dissimilar from each other, and hence, they will not get clustered. And that seems true for slightly different implementations of spectrum similarity (in the end both MS-Cluster and falcon use cosine similarity). So the low rate of clustered spectra seems inherent to the data characteristics.

It's possible that the probabilistic scoring measure of spectra-cluster is able to get a somewhat complementary sense of spectrum similarity. The authors actually explicitly note in their second spectra-cluster paper that they saw an improvement when switching from the cosine similarity (the first spectra-cluster version was pretty much a re-implementation of MS-Cluster) to their probabilistic score. (Although from personal conversations I know that nowadays they think the increase in performance might also just be due to general improvements of the spectra-cluster code between versions 1 and 2.)

I'm somewhat skeptical that when spectra are very different based on the cosine similarity, that they still really correspond to the same peptide. It could be interesting to evaluate the spectra-cluster results using the identifications from sequence database searching to check whether clusters consist of a mixture of peptides or are pure.

Your comparison between the hashed vectors and the high-dimensional vectors matches my own analyses. For dissimilar spectra (high cosine distance) the difference between both scores can differ a bit more, but in the end we're only interested in the bottom left quadrant of those heatmaps, with highly similar neighbors. For those matches the approximated cosine distance using the hashed vectors is pretty accurate.

KilianMaes commented 3 years ago

I've checked manually the clusters generated by spectra-cluster (see my previous message, the proportion of clustered spectra was very high compared to falcon) :

I've used spectra-cluster on a subset of the CCLE data to compare the output with the one of falcon. Here are the results.

Parameters

I made vary the parameters 'precursor_tolerance' (I used the same values as in the falcon preprint) and kept the default parameters excepted :

precursor_tolerance = 20 ppm

(by default, 5 rounds are made) I used MSConvert (ProteoWizard) to convert the mzML to mgf files (I only kept the MS2 spectra).

[...]

Results for ~460 000 MS2 spectra (12 fractions) clustered by spectra-cluster

threshold_end # clustered sp % clustered sp # clusters av (# sp)/cluster 0.99999 10481 2.26% 4546 2.31 0.9999 39278 8.48% 16253 2.42 0.999 86854 18.74% 34310 2.53 0.99 160811 34.70% 57768 2.78 0.95 230385 49.71% 70536 3.27 0.9 262192 56.57% 74667 3.51 0.8 295242 63.70% 78084 3.78 0.7 314597 67.88% 79439 3.96 @bittremieux it seems that the proportion of clustered spectra is larger with spectra-cluster. However I'm not totally sure how to interpret these results, because the algorithms are different, as well as the parameters (eps for falcon and threshold_end for spectra-cluster). Based on your previous comparison of these two algorithms, do you think that the proportion of clustered spectra with falcon is anormally low compared to spectra-cluster? Do you think anything other than the data itself could explain this difference?

I noticed that the cosine distance between spectra in the same cluster tended to be very high (often close to 1). Thus, I clustered the same data (~460 000 spectra) but this time, I specified the value for the fragment ion tolerance : 0.05 mz (the first time I used spectra-cluster, I kept the default value which is 0.5 I think, not sure). Such that, the fragment ion tolerance is the same as for falcon. The proportion of clustered spectra decreased considerably :

threshold_end	# clustered sp	% clustered sp	# clusters av (# sp)/cluster
0.99999	3650	0.79 %	2.19
0.99	12797	2.76 %	2.34

For example, for threshold_end = 0.99, the proportion of clustered spectra decreased from 34.70 % to 2.76 % when I decreased the fragment ion tolerance. That seems to indicate that many spectra were considered similar thanks to the permissive fragment ion tolerance.

Moreover, I checked what was the cosine similarity inside the clusters generated with the more restrictive fragment ion tolerance (0.05 mz). I know that the cosine distance is not the distance function used in spectra-cluster (at least not the last version), but I wanted to have some point of comparison. For threshold_end = 0.99999, I computed the average cosine distance (using the high dimensional vectors, not processed) between the spectra of the same clusters for the first 100 clusters. In average, the average cosine distance inside the clusters is 0.51 : the probabilistic distance function used by spectra-cluster seems to be more permissive than the cosine distance, it considers spectra as similar even if their cosine distance is "high".

That seems to answer my last question; why the proportion of clustered spectra was higher with spectra-cluster. Note that @lgatto is investigating why the spectra in the CCLE data set tend to be very dissimilar, and he noticed that CCLE contains many chimeric spectra, which could explain the dissimilarity.

I think we can now close this issue, thanks a lot @bittremieux for quickly replying to all our comments.

bittremieux commented 3 years ago

That's a pretty big influence of the fragment mass tolerance, but the difference in performance between spectra-cluster and falcon makes a lot more sense now.

Also interesting to see the low intra-cluster cosine similarity, even with such a high threshold. It corresponds with some troubles I have had to get a very pure (i.e. no incorrectly clustered spectra) result out of spectra-cluster.

Chimeric spectra are a potential complicating factor for all clustering tools I think. falcon at least doesn't do anything special to try to account for chimeric spectra, because I don't even know what could be done. However, from the earlier cosine density plots we saw that some neighbors are very similar, a lot are extremely dissimilar, but there are no neighbors with average similarity. With chimeric spectra I'd actually expect the latter: the spectra should be similar because they share peaks coming from one ion, but the similarity is reduced a bit due to the "noise" peaks from the other ion(s). We don't see such a behavior here though.

So this seems like an interesting/weird dataset. Thank you for all your detailed investigations. In the end the problem was mainly with the data, but thanks to your efforts I was able to make some improvements to falcon as well.

bittremieux / falcon