Overclustering in clusterrmdup

pontushojer commented 4 years ago

One issue that I have been noticing is that clusterrmdup seems to merge together a large number of barcodes for most datasets.

Example 1 from testrun and for chr1.

Data can be found at /proj/uppstore2018173/private/pontus/runs/200707_synchronise-molecules_test

Looking first at the barcode-merges CSV we can see that the top merge cluster has been assigned 28,819 new barcodes while the second top only got 38.

(blr) [phojer@rackham2 chunks]$ tail -n +2 chr1.barcode-merges.csv | cut -d, -f2 | sort | uniq -c | sort -nr | head
  28819 AAAACATGTGCATATGTTCA
     38 CAAATACGGTTGCGCCGGCC
     27 AAGGTTGTGTGGGTGTGCG
     24 CAAATGCCGACATACACACG
     23 CATGCAAAGGTATGAGCTTG
     21 CAAATTCCGGAATGCCTTAA
     19 CAAACAAGCAACGGAGATCC
     17 CAACGATAGGCCGTCAGTAA
     16 CATGGTCCTAAAGAAGCACA
     16 CAAATGTATTCGCAAACTTA

Example 2 run on full dataset

Data can be found at /proj/uppstore2018173/private/analysis/200609.P14314_1006.pontus.rerun

Here 113,055 barcodes are asigned to the AAAACAACGATCGTTCTGCG barcode while the next top ones have about 20-30 each.

(blr) [phojer@rackham2 200609.P14314_1006.pontus.rerun]$ tail -n +2 mapped.barcode-merges.csv | cut -d, -f2 | sort | uniq -c | sort -nr | head
 113055 AAAACAACGATCGTTCTGCG
     31 CAAAGTCGCAAATACCTTTC
     28 CACATTTGGGCATATATGTC
     26 CACGCTCCGATATGCGGAAC
     25 AATGGTTACGCCGGAATATA
     24 CACAGGAGTAACTTTCGAAA
     24 ATTATTACTATACTACGTTG
     23 CAACGTTCTATGCTAGGTTA
     23 ATCCGTAATACCGTTCTTAA
     22 CATCTTCGCGTGTGCCGTAA

Possible cause

I looked more into this and it seems to be related to regions with very high read coverage, see figure below for chr1. The top track shows the log-scaled coverage. The lower track shows the positions where most of the barcodes were merged (>10 barcodes). As you see most of the merges occurred at these high coverage regions.

Solution?

The high coverage is likely a mapping artefact so in some way we need to either remove these regions from the clusterrmdup analysis or change how reads are mapped.

I found that 10x uses a decoy in their reference. Possibly this could remove some of the reads mapped here.
Test different mappers, all examples above are from bowtie2 mappings.
Create or find a blacklist with regions to avoid in clusterrmdup.

FrickTobias commented 4 years ago

Do these reads seem to have a high mapping quality?

pontushojer commented 4 years ago

That is a good point, I have not checked that yet.

Here is a image of the chr1 coverage for different mapq (filtererd from 0 to 30)

This improved things quite a bit, though there are still a large number of reads with high mapq in those regions. We should definitely think about including a threshold for this in the pipe.

FrickTobias commented 4 years ago

Definitely. But as you pointed out is used by 10x, adding decoy sequences might also be a way forward.

pontushojer commented 4 years ago

I run some test using BWA and running with and without the decoy.

Looking at the coverage BWA compared to bowtie2 (top two tracks in image below) actually had higher coverage in the regions. Comparing too using a decoy there was almost no noticeable difference in coverage between using and not using (bottom two tracks below).

Looking at the top barcode merges for chr1 with there is a very slight improvement in the how manny are assigned to the AAAACATGTGCATATGTTCA barcode, 30435 for no decoy and 30402 with decoy. This is however still more than mapping with bowtie2 that had "only" 28819 (see first comment in issue).

Top barcode merges - Without decoy

  30435 AAAACATGTGCATATGTTCA
     41 CAACCTAGGTTGGTTCCATG
     39 AGTGCAACTTTCCATCGTTC
     31 CAAAGGCCGATATATGGTAG
     29 ATTCCTCGGTAGTACGTAAG
     26 CACAAACCCGCAGAACCAAG
     26 CAAATTCCGGCATTACCGCG
     26 AAGGTTGTGTGGGTGTGCG
     24 CGTGCTCCCGTCGTAACTAG
     21 CAAGCTTCGGCATTCATACG

Top barcode merges - With decoy

  30402 AAAACATGTGCATATGTTCA
     41 CAACCTAGGTTGGTTCCATG
     39 AGTGCAACTTTCCATCGTTC
     31 CAAAGGCCGATATATGGTAG
     29 ATTCCTCGGTAGTACGTAAG
     26 CACAAACCCGCAGAACCAAG
     26 AAGGTTGTGTGGGTGTGCG
     25 CAAATTCCGGCATTACCGCG
     24 CGTGCTCCCGTCGTAACTAG
     21 CAAGCTTCGGCATTCATACG

pontushojer commented 4 years ago

I looked a bit into how 10x handles their BAM files and found that they have step called FILTER_BARCODES (see their pipe) which "Removes barcodes not associated with single-occupancy GEM paritions" i.e. removes cluster duplicates.

Thier solution is based in this python script. As far as I can understand the identify these by creating 50 kbp bins for the entire genome and counting the number of unique barcodes in each so they get a matrix of barcodes vs bins. Then they compare rows (barcodes) against each other to measure the amount of overlap. Interestingly I found that they in one place remove the bins with the highest coverage from this operation, seemingly the got into the same problem as us.

pontushojer commented 4 years ago

After merging https://github.com/NBISweden/BLR/pull/30 it is clear that this has become a big issue. From a test run on a SM10 dataset about 40% of the reads where filtered out, which is a substantial loss.

FrickTobias / BLR