ahmohamed / lipidr

Data Mining and Analysis of Lipidomics datasets in R
https://www.lipidr.org/
Other
27 stars 13 forks source link

296 non-parsed molecules #21

Closed chloemhall closed 2 years ago

chloemhall commented 2 years ago

HI, I have many non-parsed molecules. Is there a way to sort through them, or changes the names into a format lipidr can read? e.g. non-parsed "Cer[NS] d36:2" "Cer[NS] d38:0" "Cer[NS] d38:2"
but "Cer[NS] d32:1" is read fine…

Thanks, Chloe

ahmohamed commented 2 years ago

Hi Chloe,

Very sorry for the delayed response. Non-parsed molecules are fixed by renaming them. You can do that from R using regex. In your case, the problem is your class names contain non-alphanumeric characters, you can to the following:

old_names = non_parsed_molecules(data)
new_names = sub("[NS]","NS", old_names, fixed = TRUE)
data = update_molecule_names(data, old_names, new_names)

Of course, if you have other non-parsed patterns you'll need to address them as well. Refer to Regex manual and let me know if you need further help.

Cheers, Ahmed.

chloemhall commented 2 years ago

Hi Ahmed,

Many thanks for your response!! After digging into this more I don't think the issue can be related to non-alphanumeric characters as plenty of lipid names with : or - are parsed fine. In addition, if I just change some of the non-parsed examples above to "a" "b" "c" etc, they still remain non-parsed… do you have any idea of what lipidr is looking for in the names please? Do you for example have a standard list of lipids it looks for?

Thanks and sorry to bother you more, best wishes, Chloe

ahmohamed commented 2 years ago

Hi Chloe, This is probably because you're using a single letter as the class name. Since no class names are single-lettered in LipidMaps, lipidr doesn't support them. These are the main patterns the lipidr uses to parse the names:

lipidnames_pattern$class <- "([[:alnum:]]{2,15})"
lipidnames_pattern$chain <- "(\\d{1,2}:\\d{1,2})"

You can see, classes should be 2-15 alphanumeric characters. Chains should be numeric formatted as xx:yy (1-2 digits).

If it still doesn't work, it would be good to copy here the list of non-parsed molecules so I can help with.

Cheers, Ahmed.

chloemhall commented 2 years ago

Dear Ahmed,

Cannot thank you enough for your kind help with this problem. I believe we have now solved it using the labelling formats you suggested, so thank you again.

Best, Chloe

ahmohamed commented 2 years ago

Glad it worked out in the end.