databio / pepatac

A modular, containerized pipeline for ATAC-seq data processing
http://pepatac.databio.org
BSD 2-Clause "Simplified" License
54 stars 15 forks source link

Remove duplicates first or filter low mapping quality first? #148

Closed yangmqglobe closed 3 years ago

yangmqglobe commented 4 years ago

So I've check this by switch these two step, the result is

$ ls *_*_*.bam|xargs -t -I {} samtools view -c {}
samtools view -c SRR7866359_dedup_filter.bam
19161434
samtools view -c SRR7866359_filter_dedup.bam
19179420

Obviously the result is different. But why? Is this because that a low mapping quality reads paired with a high mapping quality reads? Should we treat these reads as unpaird and filter out before call peaks?

jpsmith5 commented 4 years ago

Okay. So to confirm I follow, here's what you did:

Version1 (same as PEPATAC): For SRR7866359_filter_dedup.bam you 1) aligned, 2) removed low quality reads, 3) removed duplicates Version2: For SRR7866359_dedup_filter.bam you 1) aligned, 2) removed duplicates, 3) removed low quality reads

--

I grabbed SRR7866359 and performed the same. I then intersected the two bam files with bedtools intersect.

Just like in your example, filtering first leaves more reads than the reverse. Based on the intersect, there are no reads in the "Version2" approach that are unique. Meaning, while the Version1 approach retains extra reads, doing it with Version2 doesn't retain a different set of unique reads, just fewer total reads. So that's interesting.

Then, I looked at where those reads unique to the Version1 approach mapped (checking the result of the bedtools intersect call). Here's that result:

samtools idxstats SRR7866359_reads_unique_to_version1_approach.bam
chr1    248956422       2187    0
chr2    242193529       2358    0
chr3    198295559       1438    0
chr4    190214555       1625    0
chr5    181538259       1530    0
chr6    170805979       1259    0
chr7    159345973       1785    0
chr8    145138636       1109    0
chr9    138394717       1352    0
chr10   133797422       1369    0
chr11   135086622       964     0
chr12   133275309       1269    0
chr13   114364328       670     0
chr14   107043718       614     0
chr15   101991189       869     0
chr16   90338345        947     0
chr17   83257441        906     0
chr18   80373285        627     0
chr19   58617616        590     0
chr20   64444167        620     0
chr21   46709983        402     0
chr22   50818468        547     0
chrX    156040895       1332    0
chrY    57227415        76      0
chrM    16569   0       0
chr1_KI270706v1_random  175055  17      0
chr1_KI270707v1_random  32032   2       0
chr1_KI270708v1_random  127682  0       0
chr1_KI270709v1_random  66860   23      0
chr1_KI270710v1_random  40176   0       0
chr1_KI270711v1_random  42210   2       0
chr1_KI270712v1_random  176043  2       0
chr1_KI270713v1_random  40745   0       0
chr1_KI270714v1_random  41717   6       0
chr2_KI270715v1_random  161471  2       0
chr2_KI270716v1_random  153799  4       0
chr3_GL000221v1_random  155397  2       0
chr4_GL000008v2_random  209709  22      0
chr5_GL000208v1_random  92689   5       0
chr9_KI270717v1_random  40062   4       0
chr9_KI270718v1_random  38054   1       0
chr9_KI270719v1_random  176845  4       0
chr9_KI270720v1_random  39050   4       0
chr11_KI270721v1_random 100316  0       0
chr14_GL000009v2_random 201709  27      0
chr14_GL000225v1_random 211173  56      0
chr14_KI270722v1_random 194050  0       0
chr14_GL000194v1_random 191469  25      0
chr14_KI270723v1_random 38115   2       0
chr14_KI270724v1_random 39555   0       0
chr14_KI270725v1_random 172810  14      0
chr14_KI270726v1_random 43739   0       0
chr15_KI270727v1_random 448248  3       0
chr16_KI270728v1_random 1872759 28      0
chr17_GL000205v2_random 185591  8       0
chr17_KI270729v1_random 280839  19      0
chr17_KI270730v1_random 112551  1       0
chr22_KI270731v1_random 150754  3       0
chr22_KI270732v1_random 41543   6       0
chr22_KI270733v1_random 179772  13      0
chr22_KI270734v1_random 165050  2       0
chr22_KI270735v1_random 42811   6       0
chr22_KI270736v1_random 181920  6       0
chr22_KI270737v1_random 103838  4       0
chr22_KI270738v1_random 99375   0       0
chr22_KI270739v1_random 73985   0       0
chrY_KI270740v1_random  37240   0       0
chrUn_KI270302v1        2274    0       0
chrUn_KI270304v1        2165    0       0
chrUn_KI270303v1        1942    2       0
chrUn_KI270305v1        1472    0       0
chrUn_KI270322v1        21476   0       0
chrUn_KI270320v1        4416    0       0
chrUn_KI270310v1        1201    0       0
chrUn_KI270316v1        1444    0       0
chrUn_KI270315v1        2276    0       0
chrUn_KI270312v1        998     0       0
chrUn_KI270311v1        12399   0       0
chrUn_KI270317v1        37690   0       0
chrUn_KI270412v1        1179    0       0
chrUn_KI270411v1        2646    0       0
chrUn_KI270414v1        2489    0       0
chrUn_KI270419v1        1029    0       0
chrUn_KI270418v1        2145    0       0
chrUn_KI270420v1        2321    0       0
chrUn_KI270424v1        2140    0       0
chrUn_KI270417v1        2043    0       0
chrUn_KI270422v1        1445    0       0
chrUn_KI270423v1        981     0       0
chrUn_KI270425v1        1884    0       0
chrUn_KI270429v1        1361    0       0
chrUn_KI270442v1        392061  27      0
chrUn_KI270466v1        1233    0       0
chrUn_KI270465v1        1774    0       0
chrUn_KI270467v1        3920    1       0
chrUn_KI270435v1        92983   5       0
chrUn_KI270438v1        112505  21      0
chrUn_KI270468v1        4055    0       0
chrUn_KI270510v1        2415    0       0
chrUn_KI270509v1        2318    0       0
chrUn_KI270518v1        2186    0       0
chrUn_KI270508v1        1951    0       0
chrUn_KI270516v1        1300    0       0
chrUn_KI270512v1        22689   0       0
chrUn_KI270519v1        138126  13      0
chrUn_KI270522v1        5674    0       0
chrUn_KI270511v1        8127    0       0
chrUn_KI270515v1        6361    3       0
chrUn_KI270507v1        5353    2       0
chrUn_KI270517v1        3253    0       0
chrUn_KI270529v1        1899    0       0
chrUn_KI270528v1        2983    0       0
chrUn_KI270530v1        2168    0       0
chrUn_KI270539v1        993     0       0
chrUn_KI270538v1        91309   7       0
chrUn_KI270544v1        1202    0       0
chrUn_KI270548v1        1599    0       0
chrUn_KI270583v1        1400    0       0
chrUn_KI270587v1        2969    0       0
chrUn_KI270580v1        1553    1       0
chrUn_KI270581v1        7046    0       0
chrUn_KI270579v1        31033   2       0
chrUn_KI270589v1        44474   0       0
chrUn_KI270590v1        4685    0       0
chrUn_KI270584v1        4513    1       0
chrUn_KI270582v1        6504    2       0
chrUn_KI270588v1        6158    0       0
chrUn_KI270593v1        3041    2       0
chrUn_KI270591v1        5796    0       0
chrUn_KI270330v1        1652    0       0
chrUn_KI270329v1        1040    2       0
chrUn_KI270334v1        1368    0       0
chrUn_KI270333v1        2699    0       0
chrUn_KI270335v1        1048    0       0
chrUn_KI270338v1        1428    0       0
chrUn_KI270340v1        1428    0       0
chrUn_KI270336v1        1026    0       0
chrUn_KI270337v1        1121    2       0
chrUn_KI270363v1        1803    0       0
chrUn_KI270364v1        2855    0       0
chrUn_KI270362v1        3530    0       0
chrUn_KI270366v1        8320    0       0
chrUn_KI270378v1        1048    0       0
chrUn_KI270379v1        1045    0       0
chrUn_KI270389v1        1298    0       0
chrUn_KI270390v1        2387    0       0
chrUn_KI270387v1        1537    0       0
chrUn_KI270395v1        1143    0       0
chrUn_KI270396v1        1880    0       0
chrUn_KI270388v1        1216    0       0
chrUn_KI270394v1        970     0       0
chrUn_KI270386v1        1788    0       0
chrUn_KI270391v1        1484    0       0
chrUn_KI270383v1        1750    0       0
chrUn_KI270393v1        1308    2       0
chrUn_KI270384v1        1658    0       0
chrUn_KI270392v1        971     0       0
chrUn_KI270381v1        1930    0       0
chrUn_KI270385v1        990     0       0
chrUn_KI270382v1        4215    0       0
chrUn_KI270376v1        1136    0       0
chrUn_KI270374v1        2656    0       0
chrUn_KI270372v1        1650    0       0
chrUn_KI270373v1        1451    0       0
chrUn_KI270375v1        2378    0       0
chrUn_KI270371v1        2805    0       0
chrUn_KI270448v1        7992    1       0
chrUn_KI270521v1        7642    2       0
chrUn_GL000195v1        182896  15      0
chrUn_GL000219v1        179198  13      0
chrUn_GL000220v1        161802  17      0
chrUn_GL000224v1        179693  21      0
chrUn_KI270741v1        157432  0       0
chrUn_GL000226v1        15008   0       0
chrUn_GL000213v1        164239  2       0
chrUn_KI270743v1        210658  16      0
chrUn_KI270744v1        168472  22      0
chrUn_KI270745v1        41891   2       0
chrUn_KI270746v1        66486   4       0
chrUn_KI270747v1        198735  7       0
chrUn_KI270748v1        93321   0       0
chrUn_KI270749v1        158759  8       0
chrUn_KI270750v1        148850  8       0
chrUn_KI270751v1        150742  11      0
chrUn_KI270752v1        27745   2       0
chrUn_KI270753v1        62944   0       0
chrUn_KI270754v1        40191   2       0
chrUn_KI270755v1        36723   0       0
chrUn_KI270756v1        79590   4       0
chrUn_KI270757v1        71251   13      0
chrUn_GL000214v1        137718  7       0
chrUn_KI270742v1        186739  10      0
chrUn_GL000216v2        176608  31      0
chrUn_GL000218v1        161147  14      0
chrEBV  171823  0       0
*       0       0       0

Okay, so we learn there that they do map across the genome.

If we investigate the samtools flags, there are only 4 flags for these reads: 147/99, 163/83. Those all share the fact they are reads mapping to the reverse strand. Not sure what it is yet about reads on the reverse strand that are retained when you perform QC filtering before deduplication...Will keep investigating.