524D / compareMS2

Compare samples by MS2 spectra
MIT License
3 stars 0 forks source link

Major changes to metrics, inputs and intermediary formats #18

Closed magnuspalmblad closed 2 years ago

magnuspalmblad commented 2 years ago

I have updated compareMS2 to take additional inputs for the distance metric used (0 = original asymmetric, 1 = original symmetric and 2 = metric in compareMS2 2.0), and also scaling power (e.g. 0.5 = square root, default so far) and noise. All metrics can be derived from the distribution, but this depend on the input parameters.

Should we still have the option in compareMS2_to_distance_matrices to recalculate the distances? I would say yes, as this is very fast, compared to rerunning compareMS2.

If the asymmetric "metric" is used, then compareMS2 should do all comparisons, except self-comparisons (i.e. A with B and B with A). For this metric, compareMS2_to_distance_matrices should compute both upper and lower triangular elements in the distance matrix.

For the other, symmetric, metrices, compareMS2 should only do half the comparisons, and compareMS2_to_distance_matrices should compute only one triangular distance matric.

If a user asks compareMS2_to_distance_matrices to use metric 0 for compareMS2 runs done with one of the symmetric metrices, then a warning message should be generated to the command line, and a single triangular matrix generated. The GUI could ask the user if they want to run the additional compareMS2 comparisons for the assymetric distances.

I have modified the output of compareMS2 to read like this:

dataset_1       100222.LC4.IT4.XX.S01325.Ho_sa_1-B,4_01_326.mgf
dataset_2       100222.LC4.IT4.XX.S01340.Pa_tr_1-D,3_01_355.mgf
set_distance    0.0776331559
set_metric      2
scan_range      1       1000000
max_scan_diff   50.00000
max_m/z_diff    0.20000
scaling_power   0.50000
noise_threshold 10.00000
dataset_1_QC    3997.0000
dataset_2_QC    3987.0000
n_gt_cutoff     620
n_comparisons   14731
histogram       -1.000  -0.990  -0.995  0
histogram       -0.990  -0.980  -0.985  0
...
histogram       0.980   0.990   0.985   2
histogram       0.990   1.000   0.995   0

These key-value(s) pairs are hopefully more consistent with the mathematical description in the papers. compareMS2_to_distance_matrices needs to be updated to read these intermediary files.

magnuspalmblad commented 2 years ago

Essentially done.