benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
459 stars 142 forks source link

The sequences being tabled vary in length. #624

Closed rahulnccs closed 5 years ago

rahulnccs commented 5 years ago

Hi, I'm using DADA2 workflow for Big Data: Paired-end using version 1.8 It worked fine but after constructing seq table I got a message "The sequences being tabled vary in length." This is the summary of sequence table

dim(seqtab) [1] 63 101 table(nchar(getSequences(seqtab)))

240 245 247 248 252 254 255 258 260 265 268 270 275 276 278 282 283 284 285 287 13 1 1 2 2 1 1 1 1 1 3 4 1 1 1 2 5 3 1 1 308 311 313 314 315 319 326 330 333 334 335 336 342 351 353 355 358 359 363 364 1 1 1 1 1 1 1 1 2 3 6 1 4 1 2 1 2 3 1 2 366 367 375 380 382 383 384 385 386 1 1 1 3 1 1 7 3 1

Shall I ignore that message/warning or Need to change some parameters?

Thanks

benjjneb commented 5 years ago

Usually you can ignore this message, but it offers a chance to check on whether the length variation you are seeing is within your expectations. Do those lengths seem (mostly) reasonable for the amplicon you've sequenced?

rahulnccs commented 5 years ago

Yes. Yet, to be sure I'm trying doing it one more time. After filtering I am getting this output, as you can see for most of the samples output is more than 60% that is good but some of the samples giving output less than 30%.

Read in 235520 paired-sequences, output 149037 (63.3%) filtered paired-sequences. Read in 76016 paired-sequences, output 50344 (66.2%) filtered paired-sequences. Read in 71133 paired-sequences, output 44552 (62.6%) filtered paired-sequences. Read in 270069 paired-sequences, output 166612 (61.7%) filtered paired-sequences. Read in 354636 paired-sequences, output 241816 (68.2%) filtered paired-sequences. Read in 339676 paired-sequences, output 225829 (66.5%) filtered paired-sequences. Read in 104596 paired-sequences, output 40981 (39.2%) filtered paired-sequences. Read in 141726 paired-sequences, output 34902 (24.6%) filtered paired-sequences. Read in 154177 paired-sequences, output 110072 (71.4%) filtered paired-sequences. Read in 102638 paired-sequences, output 69282 (67.5%) filtered paired-sequences. Read in 77665 paired-sequences, output 53562 (69%) filtered paired-sequences.

I am using parameters as follow: OUT <- filterAndTrim(fwd=file.path(pathF, fastqFs), filt=file.path(filtpathF, fastqFs),

Thanks for your promt help.

benjjneb commented 5 years ago

I totally agree with your attention to that difference in filtering percentages between samples, and you are asking the right questions. Based on my experience, I would be comfortable proceeding given the results you've shown. I see those two samples with a considerably lower fraction of reads passing filtering, but I at least have seen that variation at roughly that scale more than rarely, and in a subset of cases I've investigated have found that results were still reasonable.

rahulnccs commented 5 years ago

Thanks a lot! It helped.