Zymo-Research / figaro

An efficient and objective tool for optimizing microbiome rRNA gene trimming parameters
GNU General Public License v3.0
78 stars 24 forks source link

highest score is negative #34

Open adamsorbie opened 3 years ago

adamsorbie commented 3 years ago

Hi,

Firstly, thanks for the great tool, it's very helpful in choosing the cutoff parameters for DADA2.

I have a dataset which was very deeply sequenced and as far as I can tell the reads are not ideal quality. I ran some other qc checks alongside figaro and the output was a little strange.

{"trimPosition": [287, 270], "maxExpectedError": [51, 54], "readRetentionPercent": 76.1, "score": -5232.900874495484}
{"trimPosition": [286, 271], "maxExpectedError": [50, 55], "readRetentionPercent": 76.12, "score": -5240.876849894292}
{"trimPosition": [285, 272], "maxExpectedError": [49, 56], "readRetentionPercent": 76.14, "score": -5252.855227753219}
{"trimPosition": [284, 273], "maxExpectedError": [48, 57], "readRetentionPercent": 76.18, "score": -5268.8155871612535}

The maxExpected error values are very high and obviously the negative scores are also very strange. I'm guessing I must be doing something wrong here, but can't quite figure out what. Do you have any idea what would cause an output like this?

michael-weinstein commented 3 years ago

Ouch... that's one I haven't seen before. The simplest explanation for this would actually be poor read quality. Are you familiar with FASTQC? If so would you be able to give your reads a run through there any tell me what you see? If not, I can set up a zoom call and walk you though it.

adamsorbie commented 3 years ago

Yeah I actually ran fastqc/multiqc before running figaro. I'm already aware the quality is far from ideal but I don't have much experience with reads which are poor quality unfortunately, so I don't have much intuition to go on regarding handling this.

This is the per base sequence quality from multiqc: fastqc_per_base_sequence_quality_plot

edit: fyi, amplicon is V1-V3 507bp, sequencing PE 2 x 300.

michael-weinstein commented 3 years ago

That looks pretty bad. It would appear from this that by base 250, you're already looking at somewhere between 1 and 10% base call error. One question: are you including PhiX in this run? Would you be able to share your base frequency by position for this run graph? I think FastQC and multiqc produce that. Would you also be able to share the graphs generated by FIGARO?

adamsorbie commented 3 years ago

It's actually published data that i'm re-analysing with ASVs instead of OTUs so I don't have all the information about the sequencing but I will ask around and see if anyone knows. From what I know the sequencing was performed by eurofins or GATC but unfortunately they don't give much information on their website about what calibrations they include.

Sure, the other plots are attached. forwardExpectedError reverseExpectedError

MultiQC unfortunately doesn't export that plot so I just attached a few examples from fastqc. download (1) download