david-barnett / microViz

R package for microbiome data visualization and statistics. Uses phyloseq, vegan and the tidyverse. Docker image available.
https://david-barnett.github.io/microViz/
GNU General Public License v3.0
106 stars 11 forks source link

ps_dedupe removes all rows when using multiple variables #45

Closed avancise closed 2 years ago

avancise commented 2 years ago

Hi David,

Thank you for sharing this package, it has a lot of really helpful functions for working with phyloseq objects! It's made my life a lot easier in the past couple of weeks.

I've come across an issue while using ps_dedupe to prune my dataset. When I dedup by Lab ID (single variable), it seems to work as expected, leaving me with one entry per Lab ID. I then try to dedup by multiple variables (Individual ID and date, i.e. vars = c("Individual.ID", "year", "month", "day")), and it seems to be removing ALL rows from groups with >1 row, rather than leaving me just one row per Individual/date. I'm not sure if this is user error or something in the code, but I wondered if anyone else has had the same issue in the past.

dedup LabID

ps.sp <- ps_dedupe(ps.sp, "LabID", method = "readcount") nsamples(ps.sp) test1 <- sample_data(ps.sp)

dedup by Ind and date

ps.sp2 <- ps_dedupe(ps.sp, vars = c("Individual.ID", "year", "month", "day"), method = "readcount") nsamples(ps.sp2) test2 <- sample_data(ps.sp2)

I can send an .Rdata file with ps.sp separately, in case reproducing the issue will help. Any thoughts on what might be happening would be very appreciated!

Thank you, Amy

david-barnett commented 2 years ago

Hi Amy, I'm glad you're finding microViz helpful 😄

Sending the data would be interesting if you can. My guess is that the problem might involve NAs. (which ps_dedupe currently doesn't warn about, oops... I'll fix that regardless)

Are there any NAs in any of these variables in your data? c("Individual.ID", "year", "month", "day")

A workaround for now could be to paste together those variables into one, before using ps_dedupe with that one new variable.

test <- ps.sp %>% 
   ps_mutate(uniqueString = paste(Individual.ID, year, month, day)) %>% 
   ps_dedupe(vars = "uniqueString", method = "readcount")

Does something like that give the expected result?

avancise commented 2 years ago

Hi David,

Thank you for getting back to me so quickly! I've attached ps.sp as an .Rdata file if you'd like to play with it. ps.sp is a phyloseq object with a couple manipulations: I've merged all ASVs by species, converted raw read counts to proportional data, and removed any species that did not comprise at least 1% of the proportional read count in at least 5 samples in the dataset.

That's interesting about the NAs, and good to know! There are about 5 NAs in the dataset, in the Individual ID column. They aren't involved in any of the duplicated samples that need to be removed, but I guess could affect the whole dataset? There aren't any NAs in the year, month, or day columns.

After writing you I read through your code for ps_dedupe (which I admit I should have done first), and after understanding your workflow I ended up recreating it for myself. I've copied it below in case it's in any way useful to you, since it's essentially your idea:

dedup by Ind and date, keep sample with highest read count

Ind.date_keep <- @.***_data)) %>% rownames_to_column("SampleID") %>% mutate(readcount = phyloseq::sample_sums(ps.raw)) %>% group_by(Individual.ID, year, month, day) %>% slice_max(readcount) %>% ungroup() %>% pull(SampleID)

ps.raw <- prune_samples(ps.raw, samples = Ind.date_keep) nsamples(ps.raw)

I wasn't able to figure out exactly what the issue was that was causing all samples in a group with >1 row to be removed. I did realize that one thing affecting it was that I had previously converted all of my raw read count data to proportions by species, so that for many samples the total "readcount" was the same (i.e. 1). Going back to earlier in my workflow and deduping using a phyloseq object with raw read count data helped to a certain extent, which led me to think that there might be something going awry when multiple samples have the exact same read count. But that's as far as I made it before changing tacks to generate my own code based on your workflow - not sure if that's helpful at all or just a red herring.

So for now I have a workaround, and if no one else is having this issue it might be something special about my data that's affecting the results. I appreciate you looking into it!

Cheers, Amy

<)))>< <)))>< <)))>< <)))>< <)))>< <)))>< <)))>< <)))>< Amy M. Van Cise, Ph.D. https://amyvancise.weebly.com/ (she/her/hers)

Research Associate, North Gulf Oceanic Society http://www.whalesalaska.org/ Visiting Scientist, Genetics and Evolution Program https://www.fisheries.noaa.gov/west-coast/science-data/genetics-and-evolution-pacific-northwest NOAA Northwest Fisheries Science Center 2725 Montlake Blvd E Seattle, WA

On Thu, May 12, 2022 at 1:03 PM David Barnett @.***> wrote:

Hi Amy, I'm glad you're finding microViz helpful 😄

Sending the data would be interesting if you can. My guess is that the problem might involve NAs. (which ps_dedupe currently doesn't warn about, oops... I'll fix that regardless)

Are there any NAs in any of these variables in your data? c("Individual.ID", "year", "month", "day")

A workaround for now could be to paste together those variables into one, before using ps_dedupe with that one new variable.

test <- ps.sp %>%

ps_mutate(uniqueString = paste(Individual.ID, year, month, day)) %>%

ps_dedupe(vars = "uniqueString", method = "readcount")

Does something like that give the expected result?

— Reply to this email directly, view it on GitHub https://github.com/david-barnett/microViz/issues/45#issuecomment-1125375492, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADZISFBZ3P35MY2EXH4BQTLVJVP2TANCNFSM5VY6XFFQ . You are receiving this because you authored the thread.Message ID: @.***>

david-barnett commented 2 years ago

Hi Amy, thanks for the response and sharing your workaround, the use of slice_max gave me ideas for how to improve the new version of ps_dedupe, which will be available in the next microViz version 👍

I didn't get the data attachment, I think you have to attach it on github itself, not email.

avancise commented 2 years ago

Hi David,

GitHub wouldn't allow me to upload an Rdata file. I was hoping for better luck over email, but I guess it goes through the same system.

<)))>< <)))>< <)))>< <)))>< <)))>< <)))>< <)))>< <)))>< Amy M. Van Cise, Ph.D. https://amyvancise.weebly.com/ (she/her/hers)

Research Associate, North Gulf Oceanic Society http://www.whalesalaska.org/ Visiting Scientist, Genetics and Evolution Program https://www.fisheries.noaa.gov/west-coast/science-data/genetics-and-evolution-pacific-northwest NOAA Northwest Fisheries Science Center 2725 Montlake Blvd E Seattle, WA

On Mon, May 16, 2022 at 7:14 AM David Barnett @.***> wrote:

Hi Amy, thanks for the response and sharing your workaround, the use of slice_max gave me ideas for how to improve the new version of ps_dedupe, which will be available in the next microViz version 👍

I didn't get the data attachment, I think you have to attach it on github itself, not email.

— Reply to this email directly, view it on GitHub https://github.com/david-barnett/microViz/issues/45#issuecomment-1127729699, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADZISFA5J7D5QWP2WY75HSDVKJJ3BANCNFSM5VY6XFFQ . You are receiving this because you authored the thread.Message ID: @.***>