Closed pontushojer closed 4 years ago
Do these reads seem to have a high mapping quality?
That is a good point, I have not checked that yet.
Here is a image of the chr1 coverage for different mapq (filtererd from 0 to 30)
This improved things quite a bit, though there are still a large number of reads with high mapq in those regions. We should definitely think about including a threshold for this in the pipe.
Definitely. But as you pointed out is used by 10x, adding decoy sequences might also be a way forward.
I run some test using BWA and running with and without the decoy.
Looking at the coverage BWA compared to bowtie2 (top two tracks in image below) actually had higher coverage in the regions. Comparing too using a decoy there was almost no noticeable difference in coverage between using and not using (bottom two tracks below).
Looking at the top barcode merges for chr1 with there is a very slight improvement in the how manny are assigned to the AAAACATGTGCATATGTTCA
barcode, 30435 for no decoy and 30402 with decoy. This is however still more than mapping with bowtie2 that had "only" 28819 (see first comment in issue).
30435 AAAACATGTGCATATGTTCA
41 CAACCTAGGTTGGTTCCATG
39 AGTGCAACTTTCCATCGTTC
31 CAAAGGCCGATATATGGTAG
29 ATTCCTCGGTAGTACGTAAG
26 CACAAACCCGCAGAACCAAG
26 CAAATTCCGGCATTACCGCG
26 AAGGTTGTGTGGGTGTGCG
24 CGTGCTCCCGTCGTAACTAG
21 CAAGCTTCGGCATTCATACG
30402 AAAACATGTGCATATGTTCA
41 CAACCTAGGTTGGTTCCATG
39 AGTGCAACTTTCCATCGTTC
31 CAAAGGCCGATATATGGTAG
29 ATTCCTCGGTAGTACGTAAG
26 CACAAACCCGCAGAACCAAG
26 AAGGTTGTGTGGGTGTGCG
25 CAAATTCCGGCATTACCGCG
24 CGTGCTCCCGTCGTAACTAG
21 CAAGCTTCGGCATTCATACG
I looked a bit into how 10x handles their BAM files and found that they have step called FILTER_BARCODES
(see their pipe) which "Removes barcodes not associated with single-occupancy GEM paritions" i.e. removes cluster duplicates.
Thier solution is based in this python script. As far as I can understand the identify these by creating 50 kbp bins for the entire genome and counting the number of unique barcodes in each so they get a matrix of barcodes vs bins. Then they compare rows (barcodes) against each other to measure the amount of overlap. Interestingly I found that they in one place remove the bins with the highest coverage from this operation, seemingly the got into the same problem as us.
After merging https://github.com/NBISweden/BLR/pull/30 it is clear that this has become a big issue. From a test run on a SM10 dataset about 40% of the reads where filtered out, which is a substantial loss.
One issue that I have been noticing is that
clusterrmdup
seems to merge together a large number of barcodes for most datasets.Example 1 from testrun and for chr1.
Data can be found at
/proj/uppstore2018173/private/pontus/runs/200707_synchronise-molecules_test
Looking first at the barcode-merges CSV we can see that the top merge cluster has been assigned 28,819 new barcodes while the second top only got 38.
Example 2 run on full dataset
Data can be found at
/proj/uppstore2018173/private/analysis/200609.P14314_1006.pontus.rerun
Here 113,055 barcodes are asigned to the AAAACAACGATCGTTCTGCG barcode while the next top ones have about 20-30 each.
Possible cause
I looked more into this and it seems to be related to regions with very high read coverage, see figure below for chr1. The top track shows the log-scaled coverage. The lower track shows the positions where most of the barcodes were merged (>10 barcodes). As you see most of the merges occurred at these high coverage regions.
Solution?
The high coverage is likely a mapping artefact so in some way we need to either remove these regions from the
clusterrmdup
analysis or change how reads are mapped.clusterrmdup
.