Can't make a lipiomics experiment due to lipid names

gdmcdonald commented 3 years ago

While trying to create a lipidomics experiment from a a csv I have loaded into a tibble,

exp <- as_lipidomics_experiment(exp_df, logged = FALSE, normalized = TRUE)

I keep getting this error:

Error in as_lipidomics_experiment(exp_df, logged = FALSE, normalized = TRUE) : Data frame does not contain valid lipid names. Lipids features should be in rownames or the first column.

My first column is a character vector with names that look like this:

> sample(exp_df$lip_name, 50)

 [1] "TG 16:0/12:0/20:3" "MePC 36:3"         "SM 40:1"           "TG 25:1/18:1/18:3" "LPC 18:2"          "PC 18:4/16:0"     
 [7] "Cer 19:0/24:1"     "PI 16:0/18:1"      "LPC 17:1"          "Cer 19:1/25:0"     "LPC 16:1"          "MePC 37:3"        
[13] "TG 20:0/16:0/18:0" "TG 14:0/20:5/22:6" "PE 18:2/20:5"      "TG 30:0/18:0/18:0" "SM 42:1"           "Hex2Cer 18:1/22:0"
[19] "TG 20:4/22:6/22:6" "LPC 18:3"          "SM 30:0"           "ChE 24:1"          "TG 18:1/18:2/21:1" "SM 38:1"          
[25] "TG 16:1/18:2/18:2" "dMePE 20:2/22:6"   "PC 20:3/18:2"      "PC 14:0/22:6"      "LPC 20:1"          "SM 31:1"          
[31] "MePC 35:1"         "TG 16:0/17:1/18:1" "PC 16:0/20:4"      "LPC 20:1"          "SM 44:1"           "TG 19:1/18:1/18:2"
[37] "TG 16:0/22:1/22:6" "TG 15:0/12:0/22:6" "TG 11:0/15:0/17:1" "TG 16:0/16:1/18:1" "TG 16:0/11:1/20:4" "MePC 34:8"        
[43] "PC 22:5/18:2"      "DG 18:1/18:1"      "PAF 12:1"          "OAHFA 48:1"        "TG 12:0/17:1/18:2" "ST 18:1/22:0"     
[49] "TG 16:0/14:0/18:3" "PC 18:1/24:1"   

What am I doing wrong?

ahmohamed commented 3 years ago

It looks fine to me. First make sure that lip_name is your first column. exp_df[[1]] should give you the character vector of lipid names.

Lipid names compliance is checked with the internal method lipidr:::.have_lipids_molecules, which requires at least 50% of lipid names to be parsed correctly. You can check if lipidr:::.have_lipids_molecules(exp_df[[1]]) returns FALSE.

The last resort is to find which lipid names were not parsed correctly. This can be done with annot <- lipidr::annotate_lipids(exp_df[[1]]), which will give you a warning with the names that were not parsed. It will also return a data.frame with the lipid names and their parsed components. annot %>% filter(not_matched) will give you a list of non-parsed lipids.

If this gives you weird results, let me know, and I can see why lipidr can't parse your dataset.

Cheers.

gdmcdonald commented 3 years ago

Initially, it looks like the problem is the same as #10 as the 5 offending lipid names (out of 3916 names = 0.2%) which do not parse are all coenzyme Q.

> lipidr:::.have_lipids_molecules(exp_df[[1]])
[1] FALSE

> annot <- lipidr:::annotate_lipids(exp_df[[1]])
Warning message:
In lipidr:::annotate_lipids(exp_df[[1]]) :
  Some lipid names couldn't be parsed because they don't follow the pattern 'CLS xx:x/yy:y' 
    Co Q10, Co Q7, Co Q8, Co Q9

> annot %>% filter(not_matched)
# A tibble: 4 x 21
  Molecule clean_name ambig not_matched istd  class_stub chain1   l_1   s_1 chain2   l_2   s_2 chain3   l_3   s_3 chain4
  <chr>    <fct>      <lgl> <lgl>       <lgl> <chr>      <chr>  <int> <int> <chr>  <int> <int> <chr>  <int> <int> <chr> 
1 Co Q10   Co Q10     FALSE TRUE        FALSE NA         NA        NA    NA NA        NA    NA NA        NA    NA NA    
2 Co Q7    Co Q7      FALSE TRUE        FALSE NA         NA        NA    NA NA        NA    NA NA        NA    NA NA    
3 Co Q8    Co Q8      FALSE TRUE        FALSE NA         NA        NA    NA NA        NA    NA NA        NA    NA NA    
4 Co Q9    Co Q9      FALSE TRUE        FALSE NA         NA        NA    NA NA        NA    NA NA        NA    NA NA    
# … with 5 more variables: l_4 <int>, s_4 <int>, total_cl <int>, total_cs <int>, Class <chr>

Ok, so I remove those rows and see if it works? But it doesn't work even then:

> some_df <- exp_df %>% filter(!grepl("Co Q",lip_name))

> some_exp <- as_lipidomics_experiment(some_df, logged = FALSE, normalized = TRUE)
Error in as_lipidomics_experiment(some_df, logged = FALSE, normalized = TRUE) : 
  Data frame does not contain valid lipid names. Lipids features should be in rownames or the first column.

> lipidr:::.have_lipids_molecules(some_df[[1]])
[1] FALSE

annot2 <- lipidr:::annotate_lipids(some_df[[1]])
> sample_n(annot2,10)
# A tibble: 10 x 21
   Molecule clean_name ambig not_matched istd  class_stub chain1   l_1   s_1 chain2   l_2   s_2 chain3   l_3   s_3 chain4
   <chr>    <fct>      <lgl> <lgl>       <lgl> <chr>      <chr>  <int> <int> <chr>  <int> <int> <chr>  <int> <int> <chr> 
 1 MePC 38… MePC 38:6  FALSE FALSE       FALSE MePC       38:6      38     6 ""        NA    NA ""        NA    NA ""    
 2 phSM 38… phSM 38:2  FALSE FALSE       FALSE phSM       38:2      38     2 ""        NA    NA ""        NA    NA ""    
 3 TG 11:0… TG 11:0/2… FALSE FALSE       FALSE TG         11:0      11     0 "24:2"    24     2 "24:2"    24     2 ""    
 4 TG 20:0… TG 20:0/1… FALSE FALSE       FALSE TG         20:0      20     0 "10:3"    10     3 "10:3"    10     3 ""    
 5 MePC 29… MePC 29:0  FALSE FALSE       FALSE MePC       29:0      29     0 ""        NA    NA ""        NA    NA ""    
 6 TG 20:5… TG 20:5/1… FALSE FALSE       FALSE TG         20:5      20     5 "14:3"    14     3 "18:2"    18     2 ""    
 7 dMePE 1… dMePE 16:… FALSE FALSE       FALSE dMePE      16:0      16     0 "18:2"    18     2 ""        NA    NA ""    
 8 SM 38:0  SM 38:0    FALSE FALSE       FALSE SM         38:0      38     0 ""        NA    NA ""        NA    NA ""    
 9 TG 18:1… TG 18:1/1… FALSE FALSE       FALSE TG         18:1      18     1 "18:1"    18     1 "22:0"    22     0 ""    
10 TG 16:0… TG 16:0/1… FALSE FALSE       FALSE TG         16:0      16     0 "18:1"    18     1 "20:4"    20     4 ""    
# … with 5 more variables: l_4 <int>, s_4 <int>, total_cl <int>, total_cs <int>, Class <chr>

So now all the lipid names parse just fine, the names are in the first column of the data frame, and it still doesn't recognize them?

Not sure what's wrong here?

ahmohamed commented 3 years ago

Thanks for the info and sorry you're still having issues. This definitely looks like a bug in lipidr, however I can't reproduce it on my end. It's also different from #10, since lipidr is tolerant to 50% non-parsed molecules, and you definitely don't need to remove these lipids for it to work.

Few options here:

Check you are using latest lipidr version (2.4 or later).
Try converting your df to data.frame with as.data.frame. Tibbles work fine on my end, but just in case they are the cause of the problem.
The bug seems to originate from lipidr:::.have_lipids_molecules, which is surprising given that it's a very simple function (https://github.com/ahmohamed/lipidr/blob/master/R/check_files.R#L83). You can try:

mols <- unlist(df[[1]])
matched <- !annotate_lipids(mols, no_match = "ignore")$not_matched
print(sum(matched))
print(length(matched))

Simply, sum(matched) should at least be half of length(matched).

Alternatively, you can email me the molecule list to my email and I'll look into it for you.

Thanks.

gdmcdonald commented 3 years ago

Thanks for your help. Even though I installed lipidr a few days ago, turns out BioC won't install the latest version of itself and therefore of lipidr unless I'm running R > 4.0. So now I have upgraded everything and it finally works. Thanks again.

ahmohamed / lipidr

Can't make a lipiomics experiment due to lipid names #14