Further dimensionality reduction

Hello,

As discussed by email with @bittremieux, I compared the hashing trick currently used in falcon with other dimensionality reduction methods. Here are my results.

Dimensionality reduction methods

I compared the following methods :

Hashing trick (by using the functions directly available in falcon);
PCA (sklearn implementation);
Gaussian random projection (sklearn implementation).

Since the dimensionality reduction methods proposed in sklearn require high dimensional vectors as input, I've implemented a function to convert spectra to high dimensional vectors, sp_to_vecHD (which is not required for the hashing trick since the spectra are converted to low dimensional vectors on the fly, without storing the high dimensional vectors in memory).

Comparison for 800 components

Since falcon uses the cosine distance to measure distances between vectors, I compared how the cosine distance between pairs of spectra was conserved from the HD vector representation of the spectra and the corresponding LD vector. Ideally, the distance should be conserved as much as possible, s.t. the loss of precision caused by the dimensionality reduction is insignificant.

In the following figure, I plotted these distances (HD distance vs LD distance) for all the pairs of a subset of 5000 spectra (charge 2, the first 5000 spectra with a precursor mz greater than 600 in the draft of the human proteome). ncomponents_800 It shows that for close pairs of spectra (distances close to 0), the hashing trick does the job (it works well if small epsilons are used in DBSCAN, e.g. 0.1). However, for larger distances (e.g., if we use an epsilon equal to 0.3), the distance between some pairs of points is overestimated: some pairs of spectra for which the HD vectors are separated by a large distance are converted to LD vectors for which the cosine distance is under-evaluated (bottom right part of the plot). It seems PCA is not a good candidate (the mean squared error, used to assess the difference of distance between HD and LD pairs of vectors, is worse than for the hashing trick). However, Gaussian random projection shows good results: it preserves better the cosine distance during dimensionality reduction.

Note that some distances between LD vectors can become larger than 1: LD vectors obtained with random projection can have negative components (which was not the case with the hashing trick), thus the cosine can be negative and the cosine distance > 1. This is not a problem.

Comparison for a similar MSE

Since Gaussian random projection preserves better the distances than the hashing trick (of course, it would be useful to verify that on other subsets to make sure these results are generalizable), we could consider further reduce the dimensionality: instead of using 800 components, we can obtain similar precision compared to the hashing trick with only 200 components by using Gaussian random projection: gauss_200components

Performances

Of course, the performance is crucial for falcon. I cannot provide a clear comparison between the two methods, since the implementations differ. During my tests, I generated HD vectors corresponding to the vectors, then I used the implementation provided by sklearn for random projections. However, it would easily be possible to perform the conversion on the fly as is the case for the hashing trick. Note that random projections introduce a memory overhead: it requires storing the matrix used for the projection, which is of size d * d' (where d is the number of components in HD and d' is the number of components in LD). For example, for d = 28 000 (default) and d' = 200, the matrix would contain ~5M entries which (only) requires 44 MB (8 bytes/float).

I can propose a PR implementing random projection in falcon, such that we can evaluate the gain in performance (reducing the number of dimensions should improve the speed of the approximate neighbor search).

These are interesting results, thanks!

Something that's not fully obvious from the plots is the performance at low distances. We're mainly interested in the bottom left quadrant, but because there are so many dots in the figures, it's challenging to see the density of those regions. A density plot on both axes and using alpha values for the points could be useful to visualize that a bit better.

However, maybe it's not surprising that random projections are very competitive. Gaussian projections use the full dimensionality, whereas the "real" dimensionality when using feature hashing is the number of peaks after preprocessing (50 by default) and the larger dimensionality is only used to minimize hash collisions rather than to store more information.

With a lower dimensionality, (approximate) nearest neighbor searching will probably be (i) faster, and (ii) more accurate (curse of dimensionality as dimensionality increases). Given the performance benefits you have shown, it could be useful to first just substitute feature hashing for random projections and see how it impacts the cluster quality. If that comparison is positive, we can investigate how to make everything as efficient as possible. Both time consumption and memory consumption for random projections is bigger than for feature hashing, so we have to be mindful of that.

bittremieux / falcon