ahmohamed / lipidr

Data Mining and Analysis of Lipidomics datasets in R
https://www.lipidr.org/
Other
27 stars 13 forks source link

names not following the pattern 'CLS xx:x/yy:y #51

Closed czhu closed 1 year ago

czhu commented 1 year ago

I have names like PE P-18:1/18:2, TG 42:0-FA14:0, SM d18:1/20:1. How should I convert these? Thanks!

JohnHendrickx commented 1 year ago

I'm having a similar problem. My data uses a "Shorthand Notation" for "SwissLipids Name". Some of the lipids names can be processed by lipidr but others cannot. Examples:

Is there perhaps a source with which I can lookup the acceptable CLS xx:x/yy:y code for a "SwissLipids Name". Or could you point out how to correct the codes used here?

Any help greatly appreciated, John Hendrickx

ahmohamed commented 1 year ago

Sincere apologies to both of you for the late reply. See below how to reformat the examples you gave above. You can do it manually or using regex as below.

l = c("PE P-18:1/18:2", "TG 42:0-FA14:0", "SM d18:1/20:1", "Cer 32:0;O2", "HexCer 34:0;O2", "ST 27:1;O", "LPC O-14:0", "LPE O-14:0", "PC O-16:0/18:1", "PE O-16:0/16:1", "SM 30:0;O2")    
l2 = sub(";(O\\d*)", "(\\1)", l)
l2 = sub(" O-", "O ", l2)
l2 = sub(" P-", "P ", l2)
l2 = sub("-FA", "/", l2)

l2
#>  [1] "PEP 18:1/18:2"   "TG 42:0/14:0"    "SM d18:1/20:1"   "Cer 32:0(O2)"   
#>  [5] "HexCer 34:0(O2)" "ST 27:1(O)"      "LPCO 14:0"       "LPEO 14:0"      
#>  [9] "PCO 16:0/18:1"   "PEO 16:0/16:1"   "SM 30:0(O2)"
lipidr::annotate_lipids(l2)
#> # A tibble: 11 × 21
#>    Molecule  clean…¹ ambig not_m…² istd  class…³ chain1   l_1   s_1 chain2   l_2
#>    <chr>     <chr>   <lgl> <lgl>   <lgl> <chr>   <chr>  <int> <int> <chr>  <int>
#>  1 PEP 18:1… PEP 18… FALSE FALSE   FALSE PEP     18:1      18     1 "18:2"    18
#>  2 TG 42:0/… TG 42:… FALSE FALSE   FALSE TG      42:0      42     0 "14:0"    14
#>  3 SM d18:1… SM 18:… FALSE FALSE   FALSE SM      18:1      18     1 "20:1"    20
#>  4 Cer 32:0… Cer 32… FALSE FALSE   FALSE Cer     32:0      32     0 ""        NA
#>  5 HexCer 3… HexCer… FALSE FALSE   FALSE HexCer  34:0      34     0 ""        NA
#>  6 ST 27:1(… ST 27:… FALSE FALSE   FALSE ST      27:1      27     1 ""        NA
#>  7 LPCO 14:0 LPCO 1… FALSE FALSE   FALSE LPCO    14:0      14     0 ""        NA
#>  8 LPEO 14:0 LPEO 1… FALSE FALSE   FALSE LPEO    14:0      14     0 ""        NA
#>  9 PCO 16:0… PCO 16… FALSE FALSE   FALSE PCO     16:0      16     0 "18:1"    18
#> 10 PEO 16:0… PEO 16… FALSE FALSE   FALSE PEO     16:0      16     0 "16:1"    16
#> 11 SM 30:0(… SM 30:… FALSE FALSE   FALSE SM      30:0      30     0 ""        NA
#> # … with 10 more variables: s_2 <int>, chain3 <chr>, l_3 <lgl>, s_3 <lgl>,
#> #   chain4 <chr>, l_4 <lgl>, s_4 <lgl>, total_cl <int>, total_cs <int>,
#> #   Class <chr>, and abbreviated variable names ¹​clean_name, ²​not_matched,
#> #   ³​class_stub

Created on 2023-04-20 with reprex v2.0.2

JohnHendrickx commented 1 year ago

Hi Ahmed,

Thanks for your reply! I can confirm that the changes you specified produced valid lipid names that can be processed by lipidr. I've forwarded the information to the scientist I'm working with so he can verify that the values are correct

ahmohamed commented 1 year ago

Marking as closed. Feel free to reopen if this is still an issue