bittremieux / falcon

Large-scale tandem mass spectrum clustering using fast nearest neighbor searching.
BSD 3-Clause "New" or "Revised" License
24 stars 7 forks source link

Investigating splitting #14

Open mwang87 opened 2 years ago

mwang87 commented 2 years ago

Sometimes we have same precursor m/z and very similar MS/MS not ending up in the same cluster, even with high EPS values. One example here:

Falcon Clustering Networking

We can see here repetitions of 327 m/z

network_5fae3956b11346e4b120352b735d54b3_73

Specifically, we can see the reptitions here in the clustering specifically:

Link

Just one example, two clusters

mzspec:GNPS:TASK-48f893dc8a4147e59798910e6c866ce2-workflow_results/clustered_result.mgf:scan:327 mzspec:GNPS:TASK-48f893dc8a4147e59798910e6c866ce2-workflow_results/clustered_result.mgf:scan:326

image

mwang87 commented 2 years ago

Here is a clustering at EPS 0.5.

https://proteomics2.ucsd.edu/ProteoSAFe/result.jsp?task=690714c8c2434ab3ad76c6323bd0c4bd&view=view_results#%7B%22main._dyn_%23precursor_mz_lowerinput%22%3A%22327%22%2C%22main._dyn_%23precursor_mz_upperinput%22%3A%22328%22%7D

Some examples:

mzspec:GNPS:TASK-690714c8c2434ab3ad76c6323bd0c4bd-workflow_results/clustered_result.mgf:scan:362 mzspec:GNPS:TASK-690714c8c2434ab3ad76c6323bd0c4bd-workflow_results/clustered_result.mgf:scan:363

image

mwang87 commented 2 years ago

Link to file to be clustered Link.

bittremieux commented 2 years ago

I need to dig a bit deeper into the pairwise distance matrix to see the hashed vector similarities and figure out why the spectra might not be clustered together.

Looking at the overview here though, the results don't look that bad. The first cluster contains 1133 spectra, and then there are just a few stragglers spread over a few very small clusters. So to a large extent the spectra are grouped in a single large cluster.

Ideally we want all similar spectra clustered together though, so I'll have to look at the data in more detail.

mwang87 commented 2 years ago

Yeah, I agree (especially with regards to how MSCluster performed) Falcon is doing a pretty good job. Just thought it would be interesting to investigate these stragglers as they do make a qualitatively big difference in how networks are arranged. If difficult, definitely can think about cleaning up on my end with hybrid solution with Falcon.