YuLab-SMU / MicrobiotaProcess

:microbe: A comprehensive R package for deep mining microbiome
https://www.sciencedirect.com/science/article/pii/S2666675823000164
177 stars 37 forks source link

About mp_rrarefy and mp_cal_rarecurve #90

Open lbwfff opened 1 year ago

lbwfff commented 1 year ago

Hi, I want to analyze the results of metaphlan using MicrobiotaProcess, below is my code:

mpse2 <- mp_import_metaphlan(profile='metaphlan/treated_array.txt', mapfilename='metaphlan/adjmeta.txt')
mpse2
mpse2 %<>% mp_rrarefy() 
mpse2 %<>% mp_cal_rarecurve(.abundance = RareAbundance,chunks = 400) 

Treated_array.txt is a merged metaphlan result. I converted the relative abundance to absolute abundance according to the number of reads. treated_array.txt

I found that the calculation of mp_rrarefy and mp_cal_rarecurve is very slow, mp_rrarefy took me about half an hour, as for mp_cal_rarecurve I waited for two hours and still didn't get the result (so I terminated the program). I don't quite understand why it's taking so much time, and how can I fix it?

Thanks, LeeLee

lbwfff commented 1 year ago

The treated_array.txt file does not seem to be uploaded well, I will upload it again treated_array.txt

xiangpin commented 1 year ago

It seems the absolute abundance is too big. The mp_rrarefy is a wrapped function of vegan::rrarefy to process MPSE object.
I suggest that you can refer to the method (converted the relative abundance to absolute abundance) of curatedMetagenomicData package. the method is round(relativate abundance * number of reads / 100)

> mpse <- mp_import_metaphlan('./treated_array.txt')
> mpse
# A MPSE-tibble (MPSE object) abstraction: 506 × 11
# OTU=46 | Samples=11 | Assays=Abundance | Taxonomy=Kingdom, Phylum, Class, Order, Family, Genus, Speies
   OTU     Sample Abundance taxid Kingdom Phylum Class Order Family Genus Speies
   <chr>   <chr>      <dbl> <chr> <chr>   <chr>  <chr> <chr> <chr>  <chr> <chr>
 1 t__SGB… 14792… 507514084 2|12… k__Bac… p__Fi… c__B… o__L… f__La… g__L… s__La…
 2 t__SGB… 14792… 356832609 2|20… k__Bac… p__Ac… c__A… o__B… f__Bi… g__B… s__Bi…
 3 t__SGB… 14792… 288544185 2|12… k__Bac… p__Fi… c__B… o__L… f__La… g__L… s__La…
 4 t__SGB… 14792… 240335596 2|12… k__Bac… p__Pr… c__B… o__N… f__Ne… g__S… s__Sn…
 5 t__SGB… 14792… 214801099 2|12… k__Bac… p__Fi… c__B… o__L… f__La… g__B… s__Bo…
 6 t__SGB… 14792…  93028717 2|20… k__Bac… p__Ac… c__A… o__B… f__Bi… g__B… s__Bi…
 7 t__SGB… 14792…  90766276 2|12… k__Bac… p__Pr… c__G… o__O… f__Or… g__F… s__Fr…
 8 t__SGB… 14792…  83527737 2|12… k__Bac… p__Fi… c__B… o__L… f__La… g__B… s__Bo…
 9 t__SGB… 14792…  32917664 2|12… k__Bac… p__Pr… c__A… o__H… f__Ba… g__B… s__Ba…
10 t__SGB… 14792…  13227998 2|20… k__Bac… p__Ac… c__A… o__B… f__Bi… g__B… s__Bi…
# ℹ 496 more rows
# ℹ Use `print(n = ...)` to see more rows
> mpse %>% mp_extract_assays(.abundance=Abundance) %>% colSums()
14792_metaphlan 14793_metaphlan 14794_metaphlan 14795_metaphlan 14796_metaphlan
     1925809008      1294574335      2326667790      1975230225      1343337208
14797_metaphlan 14798_metaphlan 14815_metaphlan 14816_metaphlan 14817_metaphlan
     1574659346      1625652607      1664513744      2181800294      1470224228
14818_metaphlan
     2115662871
> mpse %>% dplyr::mutate(Abundance=round(Abundance/100)) %>% mp_rrarefy()
# A MPSE-tibble (MPSE object) abstraction: 506 × 12
# OTU=46 | Samples=11 | Assays=Abundance, RareAbundance | Taxonomy=Kingdom, Phylum, Class, Order, Family, Genus, Speies
   OTU    Sample Abundance RareAbundance taxid Kingdom Phylum Class Order Family
   <chr>  <chr>      <dbl>         <dbl> <chr> <chr>   <chr>  <chr> <chr> <chr>
 1 t__SG… 14792…   5075141       3411974 2|12… k__Bac… p__Fi… c__B… o__L… f__La…
 2 t__SG… 14792…   3568326       2399304 2|20… k__Bac… p__Ac… c__A… o__B… f__Bi…
 3 t__SG… 14792…   2885442       1939746 2|12… k__Bac… p__Fi… c__B… o__L… f__La…
 4 t__SG… 14792…   2403356       1614275 2|12… k__Bac… p__Pr… c__B… o__N… f__Ne…
 5 t__SG… 14792…   2148011       1444179 2|12… k__Bac… p__Fi… c__B… o__L… f__La…
 6 t__SG… 14792…    930287        625491 2|20… k__Bac… p__Ac… c__A… o__B… f__Bi…
 7 t__SG… 14792…    907663        610169 2|12… k__Bac… p__Pr… c__G… o__O… f__Or…
 8 t__SG… 14792…    835277        561343 2|12… k__Bac… p__Fi… c__B… o__L… f__La…
 9 t__SG… 14792…    329177        221201 2|12… k__Bac… p__Pr… c__A… o__H… f__Ba…
10 t__SG… 14792…    132280         88995 2|20… k__Bac… p__Ac… c__A… o__B… f__Bi…
# ℹ 496 more rows
# ℹ 2 more variables: Genus <chr>, Speies <chr>
# ℹ Use `print(n = ...)` to see more rows
lbwfff commented 1 year ago

Yes, this greatly improves the speed of mp_rrarefy, but it seems to be still slow for mp_cal_rarecurve

MEladawi commented 1 year ago

mp_cal_rarecurve took around 8 hours for me. No multiple cores unfortunately.