Roestlab / massformer

Tandem Mass Spectrum Prediction with Graph Transformers
BSD 2-Clause "Simplified" License
65 stars 22 forks source link

How to label the fragment type of NIST20? #1

Closed JosieHong closed 11 months ago

JosieHong commented 2 years ago

Hi,

Thanks for the great work! When I tried to implement it on my server, I met some problems with splitting the data of 'HCD' and 'CID'.

Following the instructor in README, I exported lr_msms_nist and hr_msms_nist of NIST20. However, after parse_and_export.py, the fragment type (frag_mode) of them is NaN. So how do you label the fragment type of them?

Thanks, Josie

adamoyoung commented 11 months ago

Hi Josie,

Sorry for the super (super) late reply, I somehow missed this issue.

When we processed the data, we only used spectra from the hr_nist_msms (high resolution) partition. I haven't tested the lr_nist_msms (low resolution) partition recently, so our preprocessing script might not work properly on those data.

However, looking at the lr_nist_msms.MSP file that I have, it seems to me like they do not include the "Frag_mode" metadata entries, which would explain your NaNs. Based on the context of this data, I think it's safe to assume they are all "CID". However, I do think there is a high degree of overlap between the low resolution and high resolution libraries, so I'm not sure how useful it will be to include these spectra in the training data.

JosieHong commented 11 months ago

I see. I will try hr_nist_msms first. Thanks a lot! ; )