Roestlab / massformer

Tandem Mass Spectrum Prediction with Graph Transformers
BSD 2-Clause "Simplified" License
65 stars 22 forks source link

Custom dataset #5

Closed itaha01 closed 5 months ago

itaha01 commented 5 months ago

Hi there! I was wondering a bit more about the custom dataset part. I am currently working with molecules and the intensities that they are emitting in certain wavelengths. At my disposal, I have the SMILES, wavelengths, and intensities. I have checked with other models that use a .csv method of representing the input data, where the header is smiles,200,201,...,500 for instance and each row starts with a SMILES, followed up by the intensities at the corresponding wavelengths.

My question is if we could implement something similar or how we could use personal data? This is mostly due to the lack of NIST data and the personal interest to apply on own data.

adamoyoung commented 5 months ago

Hi itaha01,

Are you trying to train a model on your own MS/MS data, or make predictions using a pretrained model?

If it's the former, there isn't really code in the repo supporting this use case. You could pretty easily modify the code to support this though.

If it's the latter, we do provide weights for a model that is trained on open-source data from MoNA (MassBank of North America), however the performance will be a lot worse than what we report in the paper. This use case is demonstrated in the README (Section "Using a Pretrained Model to Make Predictions"). NIST prohibits the release of parameters for models trained on its spectral libraries, so unfortunately we cannot release the checkpoints that were used in our experiments.

itaha01 commented 5 months ago

Hi and thanks for your quick reply! Yes indeed, I am trying to train a model on my own MS/MS data. I was more or less wondering what was required from my side to ensure what the input data had. My goal is to predict UV-spectra given SMILES and intensities for specific wavelengths.

I had a look at your repo and noticed the usage of MoNA and NIST and was wondering if my data had to be formated like those.

Best regards, itaha01

adamoyoung commented 5 months ago

I'm not sure how similar UV-Vis spectra are to MS/MS spectra. For example, with modern instruments MS/MS spectra tend to have really sharp peaks, which makes a centroided representation more reasonable. This representation might not be suitable for UV-Vis.

I would look at projects that are about UV-Vis prediction, for example this one (just the first on that I found off Google).

itaha01 commented 5 months ago

I see, thank you very much for you expert input!