False positives despite >Q filtering?

GeoMicroSoares commented 2 years ago

Hi again @MBoemo ,

I've been processing a R9.4.1 run I did with 4 barcodes - 1, 3 and 4 obtained a BrdU-fed experiment and barcode 2 being a negative control. However, I'm getting quite weird results. All this has been run by strictly following the recommendations shown here. As you can see, there's a drop out after increasing Q (-q) to 30 and through 40. However, after that -q 45 and -q 50 see an increase in %BrdU incorporation (left plot). This causes quite a change in total BrdU incorporation after subtracting the percentages found in barcode 2 from the others (right plot).

To generate the input for these plots, I've simply ran the script below adapted from here, filtering out any BrdU calls below 90% confidence and summarising those results. The resulting table was plotted in R after calculating the number of total BrdU calls per barcode and then dividing those over the total number of Ts in the reference genome. The plot on the right subtracts the % BrdU incorporation found in barcode 2 from the remainder ones given the BrdU calls were found across the whole genome and thus the error was assumed to be random. Am I calculating something wrong here or is there just an inherent false positive rate for DNAscent?

All we're interested in at this point is to obtain the % BrdU incorporation per barcode. Thanks in advance for any advice you can give.

#!/home/andre/anaconda3/envs/tf_gpu_env/bin/python

import sys
import os
import pandas as pd

data = []
directory = os.fsencode('.')
for file in os.listdir(directory):
        filename = os.fsdecode(file)
        if filename.endswith(".output.detect"):
                new_file = filename+".summary"
                if os.path.isfile(new_file):
                        print("File is there, not doing anything")
                        continue
                else:
                        print("Doing ", filename)
                        f = open(filename,'r')
                        for line in f:
     #ignore the header lines
                                if line[0] == '#':
                                        continue

     #split the line into a list by whitespace
                                splitLine = line.rstrip().split()

                                if line[0] == '>':
                                        readID = splitLine[0][1:]
                                        chromosome = splitLine[1]
                                        refStart = int(splitLine[2])
                                        refEnd = int(splitLine[3])
                                        strand = splitLine[4]
                                else:
                                        posOnRef = int(splitLine[0])
                                        probBrdU = float(splitLine[1])
                                        sixMerOnRef = splitLine[2]
                                        data.append([chromosome, readID, posOnRef, strand, probBrdU])
             #add these values to a container or do some processing here

                        f.close()
                        df = pd.DataFrame(data, columns=['Scaffold', 'Read_ID', 'Ref_Location', 'Strand', 'Brdu_Conf'])
                        filt = df['Brdu_Conf']>=0.9
                        df_filt = df[filt]
                        df_filt.to_csv(new_file, index=False)

MBoemo commented 2 years ago

A few thoughts:

How many reads are you testing - definitely enough that we can assume the SEM in these plots are quite low?
A threshold of > 0.9 for a positive BrdU call is quite high. I typically use a threshold of 0.5 for things like this. If you look at the ROC curves in Fig 2 of the most recent paper the model was tuned so that the actual BrdU incorporation is called using a threshold of about 0.5. So based on the benchmarks in the figure, I would expect a very high rate of false negatives for a 0.9 threshold which might be causing what you're seeing.
I've run quite a few tests in the past that show an increase in BrdU incorporation affects the quality of the basecalling, which could in turn impact mapping quality.

GeoMicroSoares commented 2 years ago

Hi again,

How many reads are you testing - definitely enough that we can assume the SEM in these plots are quite low?

Barcode 01: 303179 Barcode 02: 346841 Barcode 03: 263636 Barcode 04: 303981

A threshold of > 0.9 for a positive BrdU call is quite high. I typically use a threshold of 0.5 for things like this. If you look at the ROC curves in Fig 2 of the most recent paper the model was tuned so that the actual BrdU incorporation is called using a threshold of about 0.5. So based on the benchmarks in the figure, I would expect a very high rate of false negatives for a 0.9 threshold which might be causing what you're seeing.

Thanks for the tip - I'll get back to you with the results for these at a Q threshold >0.5. Just a quick Q - why would you expect more false positives at > Q?

I've run quite a few tests in the past that show an increase in BrdU incorporation affects the quality of the basecalling, which could in turn impact mapping quality.

I've also been worried about this - we've used 10mM BrdU for this experiment but might try other concentrations too in the future.

MBoemo commented 2 years ago

Okay so plenty of reads going in and I assume you've checked that plenty of these reads are passing each of the mapping quality thresholds above. I wouldn't expect more false positives at a higher mapping quality necessarily - the relationship between mapping quality and false positives isn't always clear cut and is probably going to depend on what you're trying to do. My best bet is this is some kind of sampling issue caused by the probability threshold, but let me know if that's not the case.

MBoemo commented 2 years ago

Also just looked at your plotting script more carefully - it looks like the data list is being declared as empty on Line 7, and then it's being continuously appended to as you iterate through each detect file (and not being emptied between files). That's going to write data to the output csv file for the detect file you're on plus all other files in the directory with the .output.detect that it previously iterated through. Was that intentional?

GeoMicroSoares commented 2 years ago

It wasn't intentional at all - not much of a python person... I'll fix that too and check on the results. Thank you again.

MBoemo commented 2 years ago

Assuming that fixed it and closing for now, but reopen if there are any further issues.

GeoMicroSoares commented 2 years ago

Hi again Michael,

So I've re-run things as you suggested (thanks again for the python tips) and I'm still getting some signal from the unlabelled barcode albeit at a smaller scale (below). I've run DNAscent with -q 30 and varied the -l parameter to check if that could improve results and it did. I also took into account BrdU call confidence >50% instead of >90% and as before counted all the unique locations called as BrdU, comparing their number with the Ts in the reference.

The results improved quite a lot so thanks again, but I'm still getting a baseline of ca. 11-12% BrdU incorporation for the unlabelled barcode. Do you think this could be improved further?

Thanks again for all the help.

MBoemo commented 2 years ago

Definitely looking better, but I'm a little suspicious of the 10% false positive rate - anything's possible, but the highest I've seen is ~1.5% and we've tested quite a few different substrates now. Can I just confirm that this is definitely DNAscent v2 (the one released with the 2020 BMC Genomics paper) and not the old HMM one? HMM-based detection had a much higher false positive rate than the current neural network approach.

The following python script is how I would calculate positive calls from the output file of DNAscent detect. Can you run this on one of your files to make sure we're getting the same answer?

import sys

#Usage: python countCalls_cnn.py <prefix>.detect

threshold = 0.50

f = open(sys.argv[1],'r')
numCalls = 0
numAttempts = 0

for line in f:

    if line[0] in ['#','%','>']:
        continue

    else:
        splitLine = line.rstrip().split()
        BrdUprob = float(splitLine[1])

        if BrdUprob > threshold:
            numCalls += 1
        numAttempts += 1
f.close()
print("Positive BrdU Calls: ",float(numCalls)/float(numAttempts) * 100, '%')

GeoMicroSoares commented 2 years ago

So, definitely version >2:

$ DNAscent --version
Version: 2.0.2
Written by Michael Boemo, Department of Pathology, University of Cambridge.
Please submit bug reports to GitHub Issues.

This made me realise we're calculating different things though - while you're calculating "per sample" hits (BrdU calls / "attempts"), I'm interested in knowing BrdU incorporation % in each barcode/sample against a reference. For this, I'm taking only >0.5 BrdU calls, but filtering them by genomic position (if >1 are present for a loci, then only 1 preserved) across all reads. In the table below I summarise the data from the two approaches (your script at > 0.5, your script at > 0.9, my approach at > 0.5 and my approach at >0.9).

Barcode	numCalls >50BC	numCalls/numAttempts >50BC	numCalls >90BC	numCalls/numAttempts >90BC	uniqueCalls >50BC	uniqueCalls/Ref_T (%) >50BC	uniqueCalls >90BC	uniqueCalls/Ref_T (%) >90BC
01	2287426	1.287 %	351023	0.197 %	1045428	91.593	287031	25.148
02	2098763	0.935 %	169589	0.075 %	856025	74.999	123343	10.806
03	1952099	1.185 %	259492	0.158 %	926677	81.189	213846	18.736
04	2392607	1.225 %	336161	0.172 %	1049330	91.935	271277	23.767

It really seems that having >90% BrdU call confidence makes more sense, as even by your method that's when you see a difference I'd call significant. It also seems both approaches are comparable, it's just that your counts include calls for the same genomic position across reads.

I think for now I'd stick to my approach and include only those calls made at >90% confidence, but I'd love to hear your thoughts after seeing the data above. Thanks again for taking time for this!

MBoemo commented 2 years ago

I see - so if I understand correctly, you're taking the reference and looking at the fraction of positions that have at least one positive BrdU call from one of the reads in the sample? If so, then I'm not surprised by the high % from barcode02 - if you have enough coverage, most positions are going to have a false positive eventually.

GeoMicroSoares commented 2 years ago

Exactly what I'm doing yes - but then I'd say increasing the minimum confidence threshold for a call to say 0.9, 0.95 would be a good way to avoid those, right? I wonder how this will compare using R10 flowcells and Q20+ chemistry...

MBoemo / DNAscent

False positives despite >Q filtering? #22