benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
459 stars 142 forks source link

Error Estimation Issues #1990

Open adomenig opened 1 month ago

adomenig commented 1 month ago

Hello,

We are having issues with the error estimation function in Dada2 on our reverse reads specifically. Whenever we run the code to learn errors (i.e. errR <- learnErrors(filtRs, multithread=TRUE, verbose = TRUE)), we get the following error message for the reverse reads:

Error rates could not be estimated (this is usually because of very few reads). Error in getErrors(err, enforce = TRUE) : Error matrix is NULL. Calls: learnErrors -> dada -> getErrors Execution halted

In response, we plotted the number of reads for each file and found that they all should have more than enough reads to run (the smallest number of reads was in the thousands). We're working with binned quality scores, which is not ideal but didn't pose a problem for the forward reads as they plotted without issue.

My initial thought was that there is a handful of files that might be causing this issue, so I ran the code for small subsections to narrow down where the issue is, but the code ran perfectly fine in this case. More specifically, I went by the last digit in the ID of our filenames, and ran it for all files ending in 0, then in 1, then 2 and so on up to 9, and saw no issues. However, when I decided to flip this method, where I leave out one group at a time (meaning I plotted everything except the files ending in 0), I got the strange error again.

For reference, these are an example of the forward and reverse files we're using in this code (converted to csv and truncated to be only 5000 lines):

SRR24062132_forward.csv SRR24062132_reverse.csv

We're a bit stumped as to where we should go from here, since the files look similar enough from our perspective and the code we use is identical to the tutorial (except for a change int he truncLen parameter). We haven't been able to narrow down where the issue might be at all, so any help at all would be appreciated! Thank you!

benjjneb commented 1 month ago

Could you share the reverse file that is throwing this error from learnErrors (a subsample as per above is fine) but in the original format? You can email me at benjamin.j.callahan@gmail.com

adomenig commented 1 month ago

Thank you for getting back to me so quickly! Unfortunately, the issue we're facing is that there doesn't seem to be a specific reverse file that's throwing the error. When I subset the dataset by filename into 10 subcategories and run those through the code, it always reaches the selfConsist step 1, which I assumed suggests that it's working since it never reaches that step when I run it with all the data. However, I can try letting the code finish running for the subcategories to see if it crashes somewhere midway through if you think that might be helpful?

benjjneb commented 1 month ago

My goal is to have a small dataset that reproduces the error you are seeing. That helps identify where the error is arising on my side, since I'm not familiar with the error you are reporting. So any single sample that throws this error is great.

adomenig commented 1 month ago

I apologize for the long wait! I've been trying to identify specific files that might be causing the error to send you a sample, but I couldn't pinpoint any problematic files. The issue seems to arise only when I process the entire dataset.

In my manual debugging efforts, I attempted leave-one-out testing with the files. Interestingly, excluding any single file made the process run smoothly. So, for our dataset of 881 samples, running the process with any random 880 files resolved the issue. This suggests the problem is likely due to the number of files rather than the content of any specific files.

For now, I've decided to proceed with 880 files. The error learning gave me these plots (using option 4 from this GitHub thread about binned quality scores), and they look good to me. However, I would love to hear your input on this matter. Given this fix, is there anything you can think of that was causing the issue in error learning?

Thank you for all your help!

Reverse Errors errR_plot_FINAL Forward Errors errF_plot_FINAL

benjjneb commented 1 month ago

The error model plots look good to me.

Given this fix, is there anything you can think of that was causing the issue in error learning?

No idea at all. That is baffling that removing one (random) sample would fix the learnErrors step. If anyone out there has seen something similar please chime in.