Dealing with "blank" samples

robinsleith commented 1 year ago

As more folks get sequence data back we have run into the issue of "blank" samples messing up the pipeline. These are samples that are run through the wet lab process with no DNA to test for contamination at various steps of the sequencing process. We often find that these files have very few reads and many have no reads after filterandtrim. I think our method of collecting stats in a tibble causes a crash when there are no files or nothing in a set of files. Do you have an idea how to make the pipeline robust to these samples? Ideally a sample with no reads after filterandtrim would just get recorded in the final track.csv as having 0 reads. I dont think we can pass empty files to downstream functions so we will have to figure out how to exclude those files without crashing the pipeline. Happy to chat more! @rfrancolini do you mind putting together a few samples we can test this on? You had examples of blanks or samples that failed crashing the pipeline right?

btupper commented 1 year ago

Hmmm. Empty tibbles by themselves are not a problem (and can be quite handy.) Do you have 5 minutes to point your screen at me? (I'm headed out for a week of vacation in about 2h).

robinsleith commented 1 year ago

Lets table until you're back, I will have to rendezvous with Rene to get an example dataset together. Have a nice vacation!!!!

btupper commented 1 year ago

Should I be putting my dada2 boots on?

rfrancolini commented 1 year ago

Ha! thanks for this reminder email. yeah... we should all probably put our heads together for this one. We just got results back from Andre where he was saying he wasnt concerned about our data even though it looked questionable to us. I gotta do a little more digging to compare our processing pipelines.

On Fri, Jan 20, 2023 at 9:10 AM Ben Tupper @.***> wrote:

Should I be putting my dada2 boots on?

— Reply to this email directly, view it on GitHub https://github.com/BigelowLab/edna-dada2/issues/28#issuecomment-1398447264, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIQW4XXBZAVBBTBWRIB2QNLWTKMEBANCNFSM6AAAAAATYFMGGE . You are receiving this because you were mentioned.Message ID: @.***>

--

Rene Francolini she/her

PhD Candidate

Graduate Research Assistant

Bigelow Laboratory for Ocean Sciences

School of Maine Sciences, University of Maine

–––––––––––––––––––––––––––––––––––––––––––––––––––

E @.***

M 973-303-5203

W Personal Website http://rfrancolini.github.io | Brady Lab Website https://umaine.edu/bradylab/ |

–––––––––––––––––––––––––––––––––––––––––––––––––––

Bold Science for Our Blue Planet | BIGELOW.ORG https://www.bigelow.org

60 BIGELOW DRIVE | EAST BOOTHBAY, MAINE 04544 USA

btupper commented 1 year ago

When works for you two?

robinsleith commented 1 year ago

We can do anytime this afternoon (before 4).

btupper commented 1 year ago

Bah - a lost day. Sorry about that. What are the prospects for today?

rfrancolini commented 1 year ago

I'm around 11-4 today

On Thu, Jan 26, 2023, 10:05 Ben Tupper @.***> wrote:

Bah - a lost day. Sorry about that. What are the prospects for today?

— Reply to this email directly, view it on GitHub https://github.com/BigelowLab/edna-dada2/issues/28#issuecomment-1405146303, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIQW4XTVLO47D6JNQUNNF4TWUKHDLANCNFSM6AAAAAATYFMGGE . You are receiving this because you were mentioned.Message ID: @.***>

robinsleith commented 1 year ago

1-4 ideal for me!

robinsleith commented 1 year ago

This may be unrelated but since we were recently messing with the track.csv for this issue I am posting here. I tried to run the pacbio pipeline which does not have reverse reads and got the following error for the track function.

Error:
! Tibble columns must have compatible sizes.
• Size 390: Existing data.
• Size 0: Column `denoised_reverse`.
ℹ Only values of size one are recycled.
Backtrace:
     ▆
  1. └─global main(CFG)
  2.   ├─readr::write_csv(make_track(), file.path(CFG$output_path, "track.csv"))
  3.   │ └─readr::write_delim(...)
  4.   │   ├─base::stopifnot(is.data.frame(x))
  5.   │   └─base::is.data.frame(x)
  6.   └─make_track()
  7.     └─dplyr::tibble(...)
  8.       └─tibble:::tibble_quos(xs, .rows, .name_repair)
  9.         └─tibble:::vectbl_recycle_rows(res, first_size, j, given_col_names[[j]])
 10.           └─rlang::cnd_signal(error_incompatible_size(n, name, size, "Existing data"))
Execution halted

robinsleith commented 1 year ago

Ok, now I am getting this as output. I understand why NA shows up for things like merged or others that have reverse reads but whats up with nonchim and final_prr. dataset is here /mnt/storage/data/edna/dada/projects/shane/april_pacbio/process

name,input,filtered,denoised_forward,denoised_reverse,nonchim,final_prr
000,63282,59626,NA,NA,NA,NA
000,92287,87552,NA,NA,NA,NA
000,54462,51809,NA,NA,NA,NA
002,53805,51177,NA,NA,NA,NA
002,37988,36206,NA,NA,NA,NA
002,37779,36012,NA,NA,NA,NA
003,22202,21168,NA,NA,NA,NA
003,16733,15900,NA,NA,NA,NA
003,14836,14165,NA,NA,NA,NA
004,30789,29405,NA,NA,NA,NA
004,43792,41718,NA,NA,NA,NA
004,46460,44160,NA,NA,NA,NA

robinsleith commented 7 months ago

I feel like we solved this but I think we (correctly for triage) focused on making sure these samples didn't crash the pipeline. Did we ever get those empty samples captured in track.csv? I think folks would like to have an accounting of what happened to every file that they feed into the pipeline, which currently does not happen for files that have no reads after cutadapt...

BigelowLab / edna-dada2

Dealing with "blank" samples #28