Closed itaha01 closed 6 months ago
Hi itaha01,
Are you trying to train a model on your own MS/MS data, or make predictions using a pretrained model?
If it's the former, there isn't really code in the repo supporting this use case. You could pretty easily modify the code to support this though.
If it's the latter, we do provide weights for a model that is trained on open-source data from MoNA (MassBank of North America), however the performance will be a lot worse than what we report in the paper. This use case is demonstrated in the README (Section "Using a Pretrained Model to Make Predictions"). NIST prohibits the release of parameters for models trained on its spectral libraries, so unfortunately we cannot release the checkpoints that were used in our experiments.
Hi and thanks for your quick reply! Yes indeed, I am trying to train a model on my own MS/MS data. I was more or less wondering what was required from my side to ensure what the input data had. My goal is to predict UV-spectra given SMILES and intensities for specific wavelengths.
I had a look at your repo and noticed the usage of MoNA and NIST and was wondering if my data had to be formated like those.
Best regards, itaha01
I'm not sure how similar UV-Vis spectra are to MS/MS spectra. For example, with modern instruments MS/MS spectra tend to have really sharp peaks, which makes a centroided representation more reasonable. This representation might not be suitable for UV-Vis.
I would look at projects that are about UV-Vis prediction, for example this one (just the first on that I found off Google).
I see, thank you very much for you expert input!
Hi there! I was wondering a bit more about the custom dataset part. I am currently working with molecules and the intensities that they are emitting in certain wavelengths. At my disposal, I have the SMILES, wavelengths, and intensities. I have checked with other models that use a .csv method of representing the input data, where the header is smiles,200,201,...,500 for instance and each row starts with a SMILES, followed up by the intensities at the corresponding wavelengths.
My question is if we could implement something similar or how we could use personal data? This is mostly due to the lack of NIST data and the personal interest to apply on own data.