Question about filtering features and groups

pig-raffles commented 5 months ago

Hi,

Thanks for creating this excellent package for analysing PICRUSt2 output data. I have been looking for something like this for a while now.

I have run the analysis main pipeline (ggpicrust2()) with a couple of different data sets and have run into some error messages that I have some questions about.

The first question is about filtering the feature data. I generally get the following warning when running the pipeline:

"In MicrobiomeStat::linda(abundance, LinDA_metadata_df, formula = "~Group_groupnonsense", : Some features have less than 3 nonzero values! They have virtually no statistical power. You may consider filtering them in the analysis!"

Do you have any advice on how best to filter the feature data. Is it just a case of opening the abundance file (TSV format) and editing it to remove these features/functions? For a data set of 10 individuals split into two treatment groups, would I filter out functions that have nonzero values for only 3 or less individuals out of the 10?

My second issue is about the number of groups I want to compare. Currently the PICRUSt2 output data I wish to use is for 4 different groups and I would like to filter this down to merely pairwise comparisons between groups. Would the simplest way of doing this be, again, to edit the abundance file, filtering out any individuals not from the pairwise comparison I wish to make. Would I also need to alter the metadata file as well?

Finally, I get also get the warning:

"In cbind(sample = colnames(sub_relative_abundance_mat), group = Group, : number of rows of result is not a multiple of vector length (arg 2)"

What could be causing this?

Thank you for your time and any help you can offer,

Best wishes,

Alan

cafferychen777 commented 5 months ago

Dear Alan,

Thank you for reaching out and for using ggpicrust2 to analyze your PICRUSt2 output data. I appreciate your detailed questions.

Regarding the warnings you encountered, I'd like to assure you that these are typical in the analysis process and generally do not have a significant impact on the overall results.

For the first warning about feature data filtering, it's common in bioinformatics pipelines to encounter features with low non-zero values. While these features have limited statistical power, their presence is a normal part of diverse datasets and doesn't necessarily compromise the analysis. If you wish to filter them, doing so directly in the abundance file (TSV format) is a standard approach. However, it's not always necessary unless they significantly skew your results or if you have specific reasons for stringent data curation.

For the second point regarding group comparisons, editing the abundance file for pairwise comparisons is indeed a straightforward method. It allows you to focus on specific groups of interest. Remember to adjust the metadata file accordingly to ensure consistency between your data and metadata.

Lastly, the warning about the row number not being a multiple of vector length often arises due to mismatches in data dimensions or when combining datasets with different lengths. It's a common warning in data processing and, in most cases, doesn't critically affect the analysis outcome.

In summary, these warnings are part of routine data analysis and do not necessarily indicate a major problem with your analysis or data. Feel free to proceed with your analysis, keeping these points in mind.

Best wishes in your research, and don't hesitate to reach out if you have further questions.

Kind regards,

Chen YANG

pig-raffles commented 3 months ago

Hi Chen,

Thanks for your help. The suggestions you gave worked and now the analysis runs.

Do you have any recommendations for a DA method suitable for smaller data sets (<10 individuals)?

Best wishes,

Alan

cafferychen777 commented 3 months ago

Hi Alan,

I'm glad to hear the suggestions worked and you were able to run the analysis successfully.

For smaller microbiome datasets with less than 10 individuals, DESeq2 could be a good differential abundance method to try. It uses shrinkage estimation for dispersion and fold change to improve results for experiments with small numbers of replicates. This helps avoid high variability or false positives sometimes seen with small sample sizes.

Other options are meta-analysis methods like Fisher's method, which combines P values across studies to gain power. But with very limited samples per group (<5), all methods will struggle. Adding more biological replicates per group is best if feasible.

Let me know if you have any other questions!

Best, Chen

pig-raffles commented 3 months ago

Sorry, one further question.

When using DESeq2, I get the following error message.

"Error in if (num_significant_biomarkers == 0) { : missing value where TRUE/FALSE needed"

As I understand it, this refers to NAs being present in the dataset. What is causing the NAs and how would I best remove them?

Thanks in advance,

Alan

cafferychen777 commented 3 months ago

Hi Alan,

Thank you for your interest in the ggpicrust2 package and for your thoughtful questions.

Regarding the error you encountered with DESeq2 and the missing values, it would be very helpful if you could share your dataset with me. Having access to the actual data you are working with would allow me to investigate the source of the NAs and determine the best approach for handling them. I would be happy to take a look and provide more specific guidance on preprocessing your data to avoid this error.

Feel free to send over your abundance and metadata files, or a representative subset of your data. I will do my best to reproduce the issue and suggest a solution. You can attach the files here on GitHub or send them to my email at cafferychen7850@gmail.com.

Please let me know if you have any other questions! I appreciate you taking the time to report this error and am committed to helping you resolve it.

Best regards, Chen

pig-raffles commented 3 months ago

Hi Chen,

Sorry for the delay. Please find the metadata file (SW_FW_ANT_Tilapia_metadata.txt) and abundance file (SW_FW_Ant_KO_pred_metagenome_unstrat.txt) attached.

Best wishes,

Alan

SW_FW_Ant_KO_pred_metagenome_unstrat.txt

SW_FW_ANT_Tilapia_metadata.txt

pig-raffles commented 2 months ago

Sorry Chen, did you get a chance to look at the files?

cafferychen777 / ggpicrust2

Question about filtering features and groups #88