ahmohamed / lipidr

Data Mining and Analysis of Lipidomics datasets in R
https://www.lipidr.org/
Other
27 stars 13 forks source link

Lipid Name Parsing for UCLA Core Mass Spec XLSX Report #22

Closed vastevenson closed 2 years ago

vastevenson commented 2 years ago

When inputting the data matrix csv, I am getting an error and cannot continue as this message is thrown:

> lipidr:::.have_lipids_molecules(expt_df[[1]])
[1] FALSE
> annot <- lipidr::annotate_lipids(expt_df[[1]])
Warning in lipidr::annotate_lipids(expt_df[[1]]) :
  Some lipid names couldn't be parsed because they don't follow the pattern 'CLS xx:x/yy:y' 
    PE O-16:0/16:0, PE O-16:0/16:1, PE O-16:0/18:0, PE O-16:0/18:1, PE O-16:0/18:2, PE O-16:0/18:3, PE O-16:0/20:1, PE O-16:0/20:2, PE O-16:0/20:3, PE O-16:0/20:4, PE O-16:0/22:4, PE O-16:0/22:5, PE O-16:0/22:6, PE O-18:0/16:0, PE O-18:0/16:1, PE O-18:0/18:1, PE O-18:0/18:2, PE O-18:0/18:3, PE O-18:0/20:2, PE O-18:0/20:3, PE O-18:0/20:4, PE O-18:0/22:4, PE O-18:0/22:5, PE O-18:0/22:6, PE P-14:0/18:0, PE P-14:0/18:1, 
PE P-16:0/16:0, PE P-16:0/16:1, PE P-16:0/18:0, PE P-16:0/18:1, PE P-16:0/18:2, PE P-16:0/20:1, PE P-16:0/20:2, PE P-16:0/20:3, PE P-16:0/20:4, PE P-16:0/22:4, PE P-16:0/22:5, PE P-16:0/22:6, PE P-18:0/16:0, PE P-18:0/16:1, PE P-18:0/18:0, PE P-18:0/18:1, PE P-18:0/18:2, PE P-18:0/18:3, PE P-18:0/20:2, PE P-18:0/20:3, PE P-18:0/20:4, PE P-18:0/22:4, PE P-18:0/22:5, PE P-18:0/22:6, PE P-18:1/16:0, PE P-18:1/16:1, PE P-18:1/18:0, PE P-18:1/18:1, PE P-18:1/18:2, PE P-18:1/18:3, PE P-18:1 [... truncated]

The lipid names are coming from UCLA Core's Mass Spec lab, so I think they're somewhat common.

Here's a link to the annot list with the strings of the unreadable lipid names: lipids_annot_list.csv

If I wrote some python to change the string from 'PE O-16:0/16:0' to 'PE_O- 16:0/16:0', would that allow lipidr to parse the name? How would you recommend I name these lipids so lipidr can successfully parse them?

Thank you so much for developing this awesome tool! I'm really excited to use it.

ahmohamed commented 2 years ago

Hi @vastevenson, You're right, I was surprised that lipidr didn't recognize "PC O-xx:yy", since it can handle "PC(O-xx:yy)" with the parenthesis. I'll try to fix that in the future.

For now, you can use Regex (R or Python if that's your preference) as follow:

expt_df[[1]] = sub("^(PC|PE) ([OP])-", "\\1\\2 ", expt_df[[1]])
expt_df[[1]] = sub("^TAG(.*)-FA.*", "TAG \\1", expt_df[[1]]) #TAGS were not parsed correctly as well.

lipidr::annotate_lipids(expt_df[[1]])
vastevenson commented 2 years ago

Hi @ahmohamed,

Thanks for the code snippet. I can confirm this does resolve the issue. One question I have is what will lipidr do if given multiple TAGs of the same name (like TAG 52:3)? Will it sum all of these values for each sample? Or should I sum these manually?

Thanks again for your help!

-Vincent

ahmohamed commented 2 years ago

Hi @vastevenson,

lipidr will keep duplicates as is through the workflow. You can make them unique by adding suffixes to them:

rowData(d)$Molecule = paste(rowData(d)$Molecule, " (", rownames(d), ")")

If you need to treat them as one entity, you can probably use summarize_transitions to merge them (taking average or max).