Closed Qiong-Yang closed 3 months ago
Hi Qiong-Yang,
Thanks for your interest. The precise mathematical definition of the transformation that we used is $y=\log_{10}(x+1)/3$. Empirically, we found that using this preprocessing on the training spectra improves our model's performance on the test spectra, so that's the main reason we include it.
Intuitively, the log transformation reduces the importance of very large peaks, and increases the importance of very small peaks. This might make the model better at predicting low-intensity peaks that would otherwise not contribute very much to the loss. It is not uncommon for mass spectra to have low entropy (in other words, contain one or two high intensity peaks), and in these cases it is very easy for the model to get a good score. Log transformation might improve the model's ability to accurately predict the smaller peaks in such spectra by increasing the penalty for missing them.
We were inspired by Zhu et al to apply this transformation, I recommended giving their paper a read if you haven't already.
May I ask why we need to perform log10over3 scaling on the spectrum signal?