bittremieux / falcon

Large-scale tandem mass spectrum clustering using fast nearest neighbor searching.
BSD 3-Clause "New" or "Revised" License
24 stars 7 forks source link

MS2 spectra without a precursor charge are ignored #26

Open YasinEl opened 2 months ago

YasinEl commented 2 months ago

Hello, I was trying to cluster mzXML files from https://massive.ucsd.edu/ProteoSAFe/dataset.jsp?task=88a7dfeeecb74131a6d6bfb7a9db0a46 in WSL:Ubuntu-22.04 but it does not seem to recognize any spectra. My parameters and output are below:

falcon BAX89_BA1_01_23240.mzXML falcon
2024-05-04 18:26:51,147 INFO [falcon/MainProcess] falcon.main : falcon version 0.1.3
2024-05-04 18:26:51,147 DEBUG [falcon/MainProcess] falcon.main : work_dir = None
2024-05-04 18:26:51,147 DEBUG [falcon/MainProcess] falcon.main : overwrite = False
2024-05-04 18:26:51,148 DEBUG [falcon/MainProcess] falcon.main : export_representatives = True
2024-05-04 18:26:51,148 DEBUG [falcon/MainProcess] falcon.main : usi_pxd = USI000000
2024-05-04 18:26:51,148 DEBUG [falcon/MainProcess] falcon.main : precursor_tol = 20.00 ppm
2024-05-04 18:26:51,148 DEBUG [falcon/MainProcess] falcon.main : rt_tol = None
2024-05-04 18:26:51,148 DEBUG [falcon/MainProcess] falcon.main : fragment_tol = 0.05
2024-05-04 18:26:51,149 DEBUG [falcon/MainProcess] falcon.main : eps = 0.100
2024-05-04 18:26:51,149 DEBUG [falcon/MainProcess] falcon.main : min_samples = 2
2024-05-04 18:26:51,149 DEBUG [falcon/MainProcess] falcon.main : mz_interval = 1
2024-05-04 18:26:51,149 DEBUG [falcon/MainProcess] falcon.main : hash_len = 800
2024-05-04 18:26:51,150 DEBUG [falcon/MainProcess] falcon.main : n_neighbors = 64
2024-05-04 18:26:51,150 DEBUG [falcon/MainProcess] falcon.main : n_neighbors_ann = 128
2024-05-04 18:26:51,150 DEBUG [falcon/MainProcess] falcon.main : batch_size = 65536
2024-05-04 18:26:51,150 DEBUG [falcon/MainProcess] falcon.main : n_probe = 32
2024-05-04 18:26:51,150 DEBUG [falcon/MainProcess] falcon.main : min_peaks = 5
2024-05-04 18:26:51,150 DEBUG [falcon/MainProcess] falcon.main : min_mz_range = 10.00
2024-05-04 18:26:51,150 DEBUG [falcon/MainProcess] falcon.main : min_mz = 40.00
2024-05-04 18:26:51,151 DEBUG [falcon/MainProcess] falcon.main : max_mz = 1500.00
2024-05-04 18:26:51,151 DEBUG [falcon/MainProcess] falcon.main : remove_precursor_tol = 1.50
2024-05-04 18:26:51,151 DEBUG [falcon/MainProcess] falcon.main : min_intensity = 0.01
2024-05-04 18:26:51,151 DEBUG [falcon/MainProcess] falcon.main : max_peaks_used = 50
2024-05-04 18:26:51,151 DEBUG [falcon/MainProcess] falcon.main : scaling = off
2024-05-04 18:26:51,156 INFO [falcon/MainProcess] falcon._prepare_spectra : Read spectra from 1 peak file(s)
2024-05-04 18:27:02,645 DEBUG [falcon/MainProcess] falcon._prepare_spectra : 0 spectra written to 0 buckets by precursor charge and precursor m/z
2024-05-04 18:27:02,655 ERROR [falcon/MainProcess] falcon.main : No valid spectra found for clustering

Tried to convert the mzXML to mzML and mgf via ProteoWizard 3.0.24124 but that did not solve the issue. I have confirmed that the files contain indeed MS2 spectra.

Thank you for the support!

Janne98 commented 2 months ago

Hi! Have you checked if there's a charge specified in the files? The current version of falcon discards spectra with a missing charge value.

YasinEl commented 2 months ago

Hey, yes seems like there is no charge reported. Unfortunately, that's common in public metabolomics data. Are you using the charge information in some way or you using this basically as a noise filter? Thanks!

bittremieux commented 2 months ago

We're mainly using the charge to split the spectra into charge-disjoint groups, to avoid that spectra with different charge states are clustered together. This is more relevant for proteomics data of course, where you'll encounter more different charges.

We can look into how we can generalize the code a bit so that this information is no longer mandatory to be present.