cerebis / qc3C

Reference-free quality assessment for Hi-C sequencing data
GNU Affero General Public License v3.0
12 stars 1 forks source link

Validate on less commonly used enzymes #19

Closed cerebis closed 4 years ago

cerebis commented 5 years ago

Though I believe the code is correct, my first test of different enzymes (the traditionally used 6-cutters in this case) produces very low estimates for Hi-C signal. Experimentally, however, bin3C genome binning of this data-set was fine. There is real signal here.

  1. For either approach, predicting the junction sequence is key and should be validated first off.
  2. Verify that no hard-coded assumptions are lingering for 4-cutters.
cerebis commented 5 years ago

bam based approach predicts signal as follows.

Here, the Hi-C fraction is measured only as the adjusted number read-thru events, which will be an under-estimate.

Summing both ezymes would give 0.72 - 1.27 %.

For HindIII using observed data, adjusted estimation of Hi-C fraction: (0.44-0.77%)
For NcoI using observed data, adjusted estimation of Hi-C fraction: (0.28-0.50%)
Long-range distance intervals:    1000nt,    5000nt,   10000nt
Number of cis-mapping pairs:     67,126,    31,001,    17,533
Relative fraction of all cis:         0.09139,   0.04221,   0.02387

Full qc3C.log

DEBUG    | 2019-05-27 15:06:51,856 |    main | 3.7.1 | packaged by conda-forge | (default, Feb 26 2019, 04:48:14)  [GCC 7.3.0]
DEBUG    | 2019-05-27 15:06:51,856 |    main | Command line: /shared/homes/120274/miniconda3/envs/qc3C/bin/qc3C bam -m 300 -t 4 -p 0.1 -s 1234 -e HindIII -e NcoI -b data_sets/ferment/SRR5890764_clean_rename.bam
INFO     | 2019-05-27 15:06:51,856 | qc3C.bam_based | Acceptance threshold: 0.100
INFO     | 2019-05-27 15:06:51,856 | qc3C.utils | Random seed was 1234
INFO     | 2019-05-27 15:06:51,856 | qc3C.bam_based | Counting alignments in data_sets/ferment/SRR5890764_clean_rename.bam
INFO     | 2019-05-27 15:06:56,166 | qc3C.bam_based | Found 40,321,543 alignments to analyse
INFO     | 2019-05-27 15:06:56,321 | qc3C.bam_based | Beginning analysis...
INFO     | 2019-05-27 15:09:20,633 | qc3C.bam_based | Number of parsed reads: 40,321,543
INFO     | 2019-05-27 15:09:20,633 | qc3C.bam_based | Number of analysed reads: 12,751,826 (31.63% of all)
INFO     | 2019-05-27 15:09:20,633 | qc3C.bam_based | Number of reads filtered [unmapped]: 0 (0.00% of analyzed)
INFO     | 2019-05-27 15:09:20,633 | qc3C.bam_based | Number of reads filtered [low mapq]: 965,849 (7.57% of analyzed)
INFO     | 2019-05-27 15:09:20,633 | qc3C.bam_based | Number of reads filtered [ref length]: 668,945 (5.25% of analyzed)
INFO     | 2019-05-27 15:09:20,633 | qc3C.bam_based | Number of reads filtered [secondary]: 0 (0.00% of analyzed)
INFO     | 2019-05-27 15:09:20,633 | qc3C.bam_based | Number of reads filtered [supplementary]: 0 (0.00% of analyzed)
INFO     | 2019-05-27 15:09:20,633 | qc3C.bam_based | Number of reads filtered [weak mapping]: 496,700 (3.90% of analyzed)
INFO     | 2019-05-27 15:09:20,633 | qc3C.bam_based | Number of reads filtered [ref terminated]: 192,579 (1.51% of analyzed)
INFO     | 2019-05-27 15:09:20,633 | qc3C.bam_based | Number of accepted reads: 11,162,777 (87.54% of analyzed)
INFO     | 2019-05-27 15:09:20,633 | qc3C.bam_based | Number of pairs resulting from accepted read pool: 1,603,037
INFO     | 2019-05-27 15:09:20,633 | qc3C.bam_based | Number of pairs trans-mapping: 868,522 (54.18% of pairs)
INFO     | 2019-05-27 15:09:20,633 | qc3C.bam_based | Number of pairs cis-mapping: 734,515 (45.82% of pairs)
INFO     | 2019-05-27 15:09:20,634 | qc3C.bam_based | Number of paired reads that fully align: 1,339,902 (91.21% of paired)
INFO     | 2019-05-27 15:09:20,634 | qc3C.bam_based | Number of paired reads whose alignment terminates early: 129,128 (8.79% of paired)
INFO     | 2019-05-27 15:09:20,634 | qc3C.bam_based | Number of paired reads not ending in a cut-site: 1,447,846 (98.56% of paired)
INFO     | 2019-05-27 15:09:20,634 | qc3C.bam_based | Number of short-range cis-mapping pairs: 572,654 (77.96% of cis)
INFO     | 2019-05-27 15:09:20,634 | qc3C.bam_based | Observed short-range mean pair separation differs from supplied insert length by -6.1%
INFO     | 2019-05-27 15:09:20,634 | qc3C.bam_based | Observed short-range mean and median of pair separation: 282nt 234nt
INFO     | 2019-05-27 15:09:20,634 | qc3C.bam_based | Observed mean read length for paired reads: 96nt
INFO     | 2019-05-27 15:09:20,634 | qc3C.bam_based | For supplied insert length of 300nt, estimated unobserved fraction: 0.3622
INFO     | 2019-05-27 15:09:20,634 | qc3C.bam_based | For observed insert length of 282nt, estimated unobserved fraction: 0.3205
INFO     | 2019-05-27 15:09:20,634 | qc3C.bam_based | For HindIII, the expected fraction by random chance at 50% GC: 0.10%
INFO     | 2019-05-27 15:09:20,634 | qc3C.bam_based | For HindIII, number of paired reads whose alignment ends at cut-site: 12,509 (0.85%)
INFO     | 2019-05-27 15:09:20,634 | qc3C.bam_based | For HindIII, number of paired reads that fully aligned and end with cut-site: 3,911 (0.27%)
INFO     | 2019-05-27 15:09:20,634 | qc3C.bam_based | For HindIII, upper bound of read-thru events: 8,598 (0.59%)
INFO     | 2019-05-27 15:09:20,634 | qc3C.bam_based | For HindIII, number of paired reads with observable read-thru: 4,881 (0.33%)
INFO     | 2019-05-27 15:09:20,634 | qc3C.bam_based | For HindIII, number of paired reads with read-thru and split alignment: 2,210 (0.15%)
INFO     | 2019-05-27 15:09:20,634 | qc3C.bam_based | For HindIII using observed data, adjusted estimation of Hi-C fraction: (0.44-0.77%)
INFO     | 2019-05-27 15:09:20,634 | qc3C.bam_based | For NcoI, the expected fraction by random chance at 50% GC: 0.10%
INFO     | 2019-05-27 15:09:20,634 | qc3C.bam_based | For NcoI, number of paired reads whose alignment ends at cut-site: 8,675 (0.59%)
INFO     | 2019-05-27 15:09:20,634 | qc3C.bam_based | For NcoI, number of paired reads that fully aligned and end with cut-site: 3,167 (0.22%)
INFO     | 2019-05-27 15:09:20,634 | qc3C.bam_based | For NcoI, upper bound of read-thru events: 5,508 (0.37%)
INFO     | 2019-05-27 15:09:20,634 | qc3C.bam_based | For NcoI, number of paired reads with observable read-thru: 3,137 (0.21%)
INFO     | 2019-05-27 15:09:20,635 | qc3C.bam_based | For NcoI, number of paired reads with read-thru and split alignment: 1,508 (0.10%)
INFO     | 2019-05-27 15:09:20,635 | qc3C.bam_based | For NcoI using observed data, adjusted estimation of Hi-C fraction: (0.28-0.50%)
INFO     | 2019-05-27 15:09:20,635 | qc3C.bam_based | Number of paired reads with insufficient flank to test for junction: 1,217 (0.08%)
INFO     | 2019-05-27 15:09:20,635 | qc3C.bam_based | Long-range distance intervals:     1000nt,    5000nt,   10000nt
INFO     | 2019-05-27 15:09:20,635 | qc3C.bam_based | Number of cis-mapping pairs:     67,126,    31,001,    17,533
INFO     | 2019-05-27 15:09:20,635 | qc3C.bam_based | Relative fraction of all cis:  0.09139,   0.04221,   0.02387
cerebis commented 4 years ago

SRA link to this dataset

https://www.ncbi.nlm.nih.gov/sra/?term=SRR5890764