cafferychen777 / MicrobiomeStat

Track, Analyze, Visualize: Unravel Your Microbiome's Temporal Pattern with MicrobiomeStat
https://www.microbiomestat.wiki/
30 stars 3 forks source link

microbiomestat ts.levels not plotting correctly in generate_taxa_areaplot_long #24

Open carmennns2 opened 6 months ago

carmennns2 commented 6 months ago

Hello,

Thank you for the great tool!

I am trying to visualise the change in taxas across multiple timepoints using generate_taxa_areaplot_long. However, in the visualisation, I am missing some (1)timepoints listed in ts.levels and (2) subject.var (please see screenshot included).

My initial timepoint is "0", and the following timepoints are "1", "2" , "3" , "5" ,"6" , "7", "8" , "10", "12", "18" ,"24". However not all timepoints are present in all samples, and I think this is where the problem lies. For example, one sample might have months 6, 10, 12, 18, 24 and another has 0, 6, 12, 18, etc.

Here are the binary breakdown of the presence/absence of timepoints in each sample_donation. sample_donation - each sample which has repeated measurements sample_id - each sample_donation comes from an individual; not balanced, some individuals may have more sample_donations

sample_donation sample 0 1 2 3 5 6 7 8 10 12 18 24
Individual_1 a 0 0 0 0 0 1 0 0 1 1 1 1
Individual_1 b 0 0 0 0 1 0 0 0 1 1 1 1
Individual_2 c 0 0 0 0 1 0 0 0 1 1 1 0
Individual_3 d 0 0 0 1 0 0 0 1 0 1 1 0
Individual_4 e 0 1 0 0 0 0 1 0 0 1 0 0
Individual_4 f 0 1 0 0 0 1 0 0 0 1 1 0
Individual_5 g 1 0 0 0 0 1 0 0 0 1 1 0
Individual_6 h 0 1 0 0 0 1 0 0 0 1 0 0
Individual_6 I 0 1 0 0 0 1 0 0 0 1 0 0
Individual_6 j 0 1 0 0 0 1 0 0 0 1 0 0
Individual_6 k 0 1 0 0 0 1 0 0 0 1 0 0
Individual_6 l 0 1 0 0 0 1 0 0 0 1 0 0
Individual_7 m 0 1 0 0 0 1 0 0 0 1 0 0
Individual_7 n 0 1 0 0 0 1 0 0 0 1 0 0
Individual_7 o 1 0 0 0 0 1 0 0 0 1 0 0
Individual_7 p 0 0 1 0 0 1 0 0 0 1 0 0

My code is as follows:


data.obj <- mStat_convert_phyloseq_to_data_obj(physeq)  #my microbiomestat data object converted from a phyloseq object

generate_taxa_areaplot_long(
  data.obj = data.obj,   #
  subject.var = "sample_donation",    #16 levels. each sample_donation has repeating measurements from different months
  time.var ="time",   #numeric, time in months
  group.var = "sample_donation",    #16 levels. each sample_donation has repeating measurements from different months
  strata.var = "sample_id",   #7 levels, each sample_id may have multiple sample_donations (basically sample_donations is nested within sample_id)
  feature.level = "Genus",
  feature.dat.type = "proportion",  
  feature.number = 20,
  t0.level = "0",
  ts.levels = c("1",  "2" , "3" , "5"  ,"6" , "7",  "8" , "10", "12", "18" ,"24"),
  base.size = 12,
  theme.choice = "bw",
  palette = NULL,
  pdf = TRUE,
  file.ann = NULL
)

I expected an areaplot with all the different time (months) on the x-axis for each subject.var

Instead, I see only times 10. 12, 18, or 24 shown. Additionally, only the first 6 sample_donations can be seen. The others are blank.

test

Attempted Solutions I have tried to enter the values into a vector and use ts.levels = later_tp, but it did not work. later_tp <- c("1", "2" , "3" , "5" ,"6" , "7", "8" , "10", "12", "18" ,"24")

I have tried to convert the values to numeric, but that also did not work (c(1, 2 , 3 , 5 ,6 , 7, 8, 10, 12, 18 ,24) )

I tried to save the image wider,as I thought there was not enough space.

If you have any suggestions, please let me know. Thank you so much (:

Wishing you a wonderful 2024 !

cafferychen777 commented 6 months ago

Hello @carmennns2,

Thank you for reaching out and providing detailed information about the issue you're facing with the generate_taxa_areaplot_long function. I have a few observations and suggestions that might help resolve the problem:

  1. Regarding subject.var and group.var Settings: You've set both subject.var and group.var to "sample_donation". Typically, subject.var should represent the individual or unit of observation, which might not necessarily be "sample_donation" unless each "sample_donation" represents a unique donor or subject. It might be worth revisiting this configuration to ensure it aligns with your data structure and analysis goals.

  2. Repeated Measurements of the Same Sample: From your table, it appears that the same sample has been measured at different time points. Could you clarify the rationale behind this? If it's indeed the same sample, why is there a need for sequencing it multiple times at different time points? Understanding this aspect might help in adjusting the analysis approach accordingly.

  3. Possible Issue with PDF Width Setting: One reason for not displaying all the data could be related to the settings for the PDF width in your visualization. You might not have set the pdf.wid parameter, which can affect the output display, especially when dealing with a large number of time points or categories. I recommend adjusting the pdf.wid parameter in the generate_taxa_areaplot_long function and then reattempting the visualization.

Please try these suggestions and let me know if they help in resolving the issue. Your feedback is crucial for improving the tool, and I appreciate your contribution to making it better.

Wishing you success with your analysis and a great year ahead in 2024!

Best regards,

Chen YANG

cafferychen777 commented 6 months ago

Hello again,

Thank you for your response. To further assist you, I want to share an example from my work that might be similar to your situation. This example uses the generate_taxa_areaplot_long function with specific parameters set. Here's how it looks in my code:

# Load necessary libraries
library(ggh4x)
library(vegan)

# Load example data
data(ecam.obj)

# Generate the taxa area plot
generate_taxa_areaplot_long(
  data.obj = ecam.obj,
  subject.var = "studyid",       # Using 'studyid' as the subject variable
  time.var = "month_num",        # Time variable set to 'month_num'
  group.var = "studyid",         # Grouping by 'studyid'
  strata.var = "antiexposedall", # Stratification variable
  feature.level = c("Class"),    # Feature level set to 'Class'
  feature.dat.type = "proportion", # Data type for features
  feature.number = 20,           # Number of features to display
  t0.level = NULL,               # Initial time level
  ts.levels = NULL,              # Time levels
  base.size = 10,                # Base size for the plot
  theme.choice = "bw",           # Theme choice for the plot
  palette = NULL,                # Palette settings
  pdf = TRUE,                    # Output as PDF
  pdf.wid = 40,                  # PDF width set to 40
  file.ann = NULL                # File annotation
)

In this example, the time points are displayed correctly in the resulting plot. The key aspects to note here are the settings for subject.var, time.var, group.var, and particularly pdf.wid. Adjusting these parameters, especially pdf.wid, could be crucial for ensuring all time points are properly visualized in your plot.

Please try adjusting your parameters similar to this example and see if it resolves the issue with your visualization. If the problem persists, feel free to share more details, and I'll be happy to assist further.

Best regards,

Chen YANG taxa_areaplot_pair_subject_studyid_time_month_num_feature_level_Class_feature_number_20_group_studyid_strata_antiexposedall_avergae.pdf

carmennns2 commented 6 months ago

Thank you @cafferychen777 for your quick response!

I have changed to code to the following but still have some issues. generate_taxa_areaplot_long( data.obj = data.obj, subject.var = "sample_id", time.var = "Time", group.var = "sample_donation", strata.var = "sample_id", feature.level = "Genus", feature.dat.type = "proportion", feature.number = 20, t0.level = c("0"), ts.levels = c("1", "2" , "3" , "5" ,"6" , "7", "8" , "10", "12", "18" ,"24"), base.size = 12, theme.choice = "bw", palette = NULL, pdf = TRUE, pdf.wid = 49, file.ann = NULL )

In response to your suggestions/comments:

  1. Regarding subject.var and group.var Settings: I have changed subject.var from "sample_donation" to "sample_id" as "sample_id" is the true variable which represents the unique subject or donor. However, because I still want the plot to be split at the sample_donation level, I left group.var as "sample_donation". I kept strata.var = "sample_id" as I would like it to be split at the "sample_id" level also.
  2. Repeated Measurements of the Same Sample: We used a novel formulation to preserve these samples. We want to assess the stability of the formulation by assessing the change in composition over time. However, due to some issues, we were not able to assess the composition regularly in the beginning of the experiment (the timepoints become more stable later on). So, instead, I would like to see how the composition changes over time as compared to the "first profile" we have.
  3. Possible Issue with PDF Width Setting: I tried many combinations, inclduing pdf.wid = 49. however, nothing changed. Could this be because I am saving with ggsave()?

After changing the subject.var to "sample_id" and pdf.wid = 49, nothing changed. My plot looks exactly the same as it did in the previous post.

Do you have any other suggestions? Thank you so much! Carmen

cafferychen777 commented 6 months ago

Hello Carmen,

Thank you for providing detailed information about your situation. I have two requests that could help me assist you better:

  1. Could you please share the complete metadata associated with your dataset? Having a full view of the metadata might provide more insights into the issue and allow me to understand the data structure and relationships better. This information is crucial for troubleshooting and offering more targeted suggestions.

  2. Regarding the setting of pdf.wid in the function: Normally, setting the pdf.wid parameter directly within the generate_taxa_areaplot_long function should prevent the issue you're experiencing in the visualization. If you're still encountering problems despite setting pdf.wid = 49, it might be related to something else in the function or the data. Could it be possible that the use of ggsave() afterwards is affecting the output? If you could provide more details about how you're using ggsave() and the settings you're applying, that might also help in diagnosing the issue.

Looking forward to your response and more details so that we can further investigate and resolve the visualization issue.

Best regards.

cafferychen777 commented 6 months ago

Hello @carmennns2,

I've thought of a potential solution to the issue you've been experiencing with visualizing the change in taxa across multiple timepoints using the generate_taxa_areaplot_long function. You can try pairing mStat_subset_data with generate_taxa_areaplot_long in a loop to iteratively generate a barplot for each individual. This approach could provide a more detailed visualization for each subject and might help in addressing the challenges you've been facing with missing timepoints and subject variables.

Best regards.

carmennns2 commented 6 months ago

Hi @cafferychen777,

Thank you for your suggestions. Would you be able to provide me an example of how to use mStat_subset_data in a loop iteration with generate_taxa_areaplot_long? Sorry for the inconvenience.

Thank you so much for your guidance!

All the best, Carmen

cafferychen777 commented 6 months ago

Hi @carmennns2,

Thank you for reaching out with your question. I'm glad to help you with an example of how to use mStat_subset_data in a loop with generate_taxa_areaplot_long. Below is a simple example for you to refer to:

# Loading the data
data(subset_T2D.obj)

# Extracting unique subject IDs
unique.subject.id <- subset_T2D.obj$meta.dat$subject_id

# Looping over each subject ID
plot.list <- lapply(unique.subject.id, function(subject.id){

  # Identifying sample IDs for the current subject
  sample.ids <- rownames(subset_T2D.obj$meta.dat[subset_T2D.obj$meta.dat$subject_id == subject.id, ])

  # Subsetting data for the current subject
  sub_subset_T2D.obj <- mStat_subset_data(subset_T2D.obj, sample.ids)

  # Generating taxa area plot for the current subset
  generate_taxa_areaplot_long(sub_subset_T2D.obj,
                              subject.var = "subject_id",
                              time.var = "visit_number_num",
                              feature.level = c("Genus"),
                              feature.dat.type = "count",
                              file.ann = subject.id)

})

# Further processing or saving the plots can be done here

In this example, group.var is set to NULL. Depending on your specific analysis needs, you might want to modify this. If you're looking to group your data by a specific variable, you should replace NULL with the name of that variable in the group.var parameter. This will allow you to analyze and visualize your data based on the groups defined by this variable.

Please let me know if you have any further questions or need additional clarification.

All the best, Chen YANG

Screenshot 2024-01-12 at 15 58 06
carmennns2 commented 6 months ago

Hi @cafferychen777,

Thank you for your suggestions. I tried mStat_subset_data, however, because I preferred the figures to all be in one plot, I chose to use generate_taxa_areaplot_long without it. I fixed my issue. It was my error. Turns out the factors under "Time" was "01", "02", "10, "11", and not "1", "2", "10, "11". The missing "0" was ordering it incorrectly.

However, in the end, my other issue was that because they all had different starting times (some started at 0, some at 1, 2, etc), t0.level = c("0") did not work for me.

So in the end, I decided to completely remove the time in numbers and instead used a categorical variable ("Timepoint0", "Timepoint1", "Timepoint2" etc) so the initial timepoint for all of them was "Timepoint0", and that solved my issue.

Thank you so much for all of your kind suggestions and great tool. I wish you all the best (:

Carmen

cafferychen777 commented 6 months ago

Dear Carmen,

I'm glad to hear you were able to resolve the issue with the timepoints in generate_taxa_areaplot_long! Converting the numeric timepoints to categorical factors makes sense as a workaround given the inconsistencies in start times across samples.

Thank you for reporting back on the solution. I appreciate you taking the time to provide those details - it's helpful for me to learn where users are running into problems or limitations using MicrobiomeStat.

I'm happy I could provide some initial troubleshooting suggestions. Please feel free to reach out if any other questions come up as you continue your analysis.

Best regards, Chen YANG