cerebis / bin3C

Extract metagenome-assembled genomes (MAGs) from metagenomic data using Hi-C.
GNU Affero General Public License v3.0
23 stars 7 forks source link

One huge cluster #39

Open orctyr opened 2 years ago

orctyr commented 2 years ago

Hi cerebis,

I used bin3c to build mags from hi-c (50G WGS and 30G Hi-C). bin3c two steps ran successfully. But the mags status was a little strange. There are more than 200 mags (total 230Mb) and the first MAG contained 220Mb. I had annotated each contig by CAT and obviously there were many different species within it. I had try to change --min-signal from 5 to 10. But it did not work. Is there any suggestion about this situation?

Best, Orctyr

cerebis commented 2 years ago

Hi Orctyr, although I have seen large clusters myself, they've been on the order of a few times the expected size of a bacterial genome, rather than 100s of times.

Can you post the bin3C log from mkmap and cluster here?

orctyr commented 2 years ago

Hi cerebis,

  1. BAM file bwa mem -5SP contigs.fasta hic_paired.fastq.gz | samtools view -F 0x904 -bS - > hic2ctg_unsorted.bam samtools sort -@ 8 -n hic2ctg_unsorted.bam hic2ctg #final got hic2ctg.bam file for bin3c
  2. QC step Trimmomatic was used to remove low-quality reads and adaptors

bin3c_mkmap.log DEBUG | 2021-11-22 09:00:21,456 | main | bin3C v0.1.1 DEBUG | 2021-11-22 09:00:21,456 | main | 2.7.15 | packaged by conda-forge | (default, Mar 5 2020, 14:56:06) [GCC 7.3.0] DEBUG | 2021-11-22 09:00:21,456 | main | Command line: bin3C.py mkmap -e Sau3AI --eta -v H54_mNGS_megahit.fa hic2ctg-3.bam bin3c_mkmap INFO | 2021-11-22 09:00:55,002 | mzd.contact_map | Reading sequences... INFO | 2021-11-22 09:00:55,197 | mzd.contact_map | Accepted 70244 sequences covering 426495174 bp INFO | 2021-11-22 09:00:55,197 | mzd.contact_map | References excluded: {'seq_missing': 0, 'too_short': 0} INFO | 2021-11-22 09:00:55,197 | mzd.contact_map | Counting reads in bam file... INFO | 2021-11-22 09:04:11,240 | mzd.contact_map | BAM file contains 175038961 alignments INFO | 2021-11-22 09:25:55,360 | mzd.contact_map | Pair accounting: OrderedDict([('poor_match', 23821762), ('not_tip', 0), ('short_insert', 0), ('ref_excluded', 0), ('accepted', 62833944), ('median_excluded', 0), ('end_buffered', 0)]) INFO | 2021-11-22 09:25:55,481 | mzd.contact_map | Total extent map weight 96513729 DEBUG | 2021-11-22 09:25:55,483 | mzd.contact_map | Setting primary acceptance mask with filtering criterion min_len: 1000 min_sig: 5 DEBUG | 2021-11-22 09:25:55,484 | mzd.contact_map | Minimum length threshold removing: 0 DEBUG | 2021-11-22 09:26:01,399 | mzd.contact_map | Minimum signal threshold removing: 35404 DEBUG | 2021-11-22 09:26:01,409 | mzd.contact_map | Accepted sequences: 34840 INFO | 2021-11-22 09:26:01,613 | main | Saving contact map instance

bin3c_clust.log DEBUG | 2021-11-22 09:36:41,887 | main | bin3C v0.1.1 DEBUG | 2021-11-22 09:36:41,887 | main | 2.7.15 | packaged by conda-forge | (default, Mar 5 2020, 14:56:06) [GCC 7.3.0] DEBUG | 2021-11-22 09:36:41,887 | main | Command line: bin3C.py cluster --no-spades -v bin3c_mkmap/contact_map.p.gz bin3c_clust2 INFO | 2021-11-22 09:36:41,887 | main | Generated random seed: 6675652 INFO | 2021-11-22 09:36:41,887 | main | Loading existing contact map from: bin3c_mkmap/contact_map.p.gz DEBUG | 2021-11-22 09:36:51,569 | mzd.contact_map | Setting primary acceptance mask with filtering criterion min_len: 1000 min_sig: 5 DEBUG | 2021-11-22 09:36:51,569 | mzd.contact_map | Using existing mask INFO | 2021-11-22 09:36:51,569 | mzd.contact_map | Preparing sequence map with full dimensions: (70244, 70244) DEBUG | 2021-11-22 09:36:54,391 | mzd.contact_map | Doing site based normalisation DEBUG | 2021-11-22 09:36:54,776 | mzd.contact_map | Map normalized DEBUG | 2021-11-22 09:36:54,777 | mzd.contact_map | Balancing contact map WARNING | 2021-11-22 09:36:59,343 | mzd.sparse_utils | treating 18534 zeros on diagonal as ones DEBUG | 2021-11-22 09:40:47,934 | mzd.sparse_utils | It took 431 iterations to achieve bistochasticity DEBUG | 2021-11-22 09:40:52,393 | mzd.contact_map | Map balanced INFO | 2021-11-22 09:43:57,837 | mzd.contact_map | After removing filtered sequences map dimensions: (34840, 34840) INFO | 2021-11-22 09:43:57,841 | mzd.cluster | Graph will have 34840 nodes DEBUG | 2021-11-22 09:44:03,767 | mzd.cluster | Building graph from edges INFO | 2021-11-22 09:45:24,617 | mzd.cluster | Finished: Name: contact_graph Type: Graph Number of nodes: 34840 Number of edges: 6085103 Average degree: 349.3170 INFO | 2021-11-22 09:45:24,677 | mzd.cluster | Clustering contact graph using method: infomap INFO | 2021-11-22 10:17:01,627 | mzd.cluster | Clustering using infomap resulted in 3810 clusters INFO | 2021-11-22 10:17:12,404 | mzd.cluster | Analyzing the contents of each cluster INFO | 2021-11-22 10:17:12,404 | mzd.cluster | Building random access index for input FASTA sequences INFO | 2021-11-22 10:17:38,473 | mzd.cluster | Writing output to the path: bin3c_clust2 DEBUG | 2021-11-22 10:17:40,724 | mzd.cluster | Writing full unordered FASTA for cluster 0 to bin3c_clust2/fasta/CL0001.fna ...... DEBUG | 2021-11-22 10:18:07,656 | mzd.cluster | Writing full unordered FASTA for cluster 3809 to bin3c_clust2/fasta/CL3810.fna INFO | 2021-11-22 10:18:07,659 | mzd.cluster | Plotting heatmap of complete solution INFO | 2021-11-22 10:18:07,663 | mzd.cluster | Clusters passing minimum extent criterion: 857 INFO | 2021-11-22 10:18:07,698 | mzd.cluster | Total number of sequences in the clustering: 18851 INFO | 2021-11-22 10:18:07,699 | mzd.cluster | After joining with active sequence mask map: 18851 INFO | 2021-11-22 10:19:51,509 | mzd.contact_map | After removing filtered sequences map dimensions: (18851, 18851) DEBUG | 2021-11-22 10:19:53,439 | mzd.contact_map | Map reordered WARNING | 2021-11-22 10:19:53,520 | py.warnings | compressed.py:708: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient. self[i, j] = values

DEBUG | 2021-11-22 10:20:14,344 | mzd.contact_map | Making raster image DEBUG | 2021-11-22 10:21:06,982 | matplotlib.font_manager | findfont: Matching :family=sans-serif:style=normal:variant=normal:weight=normal:stretch=normal:size=10.0 to DejaVu Sans (u'*/miniconda3/envs/py2.7/lib/python2.7/site-packages/matplotlib/mpl-data/fonts/ttf/DejaVuSans.ttf') with score of 0.050000 DEBUG | 2021-11-22 10:25:57,023 | mzd.contact_map | Saving plot

cerebis commented 2 years ago

Thanks @orctyr.

None of that log stands out to me as problematic.

Switching to a development branch

I have overlooked that the bin3C master branch has fallen quite far behind my development work. To begin with, can you please checkout the Python 3 branch and see what you get as a result without any of the work below.

pip install git+https://github.com/cerebis/bin3C@py3

The use of bin3C should be the same as before.

Using this codebase will ensure that my suggestions below will be possible.

QA testing with qc3C.

Would you happen to have any quality metrics for your Hi-C library? Ideally, it would be great if you could run qc3C over your data. Since you have a BAM file already, I would suggest using:

qc3C bam --enzyme Sau3AI --fasta H54_mNGS_megahit.fa --bam hic2ctg_unsorted.bam --output-path qc3c_out.

You can get qc3C from DockerHub: docker pull cerebis/qc3C and the command would then require mapping your data folder, something like:

docker run -v $PWD/mydata:/app cerebis/qc3c bam --enzyme Sau3AI --fasta H54_mNGS_megahit.fa --bam hic2ctg_unsorted.bam --output-path qc3c_out

qc3C is also installable via conda on Linux.

qc3C produces a log but the output to the console is just as informative. If you could post that here too.

It will help on two fronts. 1) It will establish the quality of your Hi-C library and 2) it will give us an estimate of library fragment size. We can then use this information to restrict contact map generation to include only pairs with sufficiently large separation.

More stringent contact map generation

Taking the estimate for mean insert/fragment size from qc3C, we can restrict contact map generation to only including pairs with larger separation distance. This helps to exclude noise (spurious pairs) from incorperation. I would normally take 3x the estimated fragment size from qc3C -- or in a pinch (qc3C isn't working for you) try 1000 bp.

bin3C mkmap -e Sau3AI --min-insert 1000 --eta -v H54_mNGS_megahit.fa hic2ctg-3.bam bin3c_mkmap

orctyr commented 2 years ago

Hi Cerebis,

Your cmd: qc3C bam --enzyme Sau3AI --fasta H54_mNGS_megahit.fa --bam hic2ctg_unsorted.bam --output-path qc3c_out. "hic2ctg_unsorted.bam" or "hic2ctg.bam" ? In your qc3C github website, it need sorted bam file "bwa index ref.fa bwa mem -5SP ref.fna.gz hic_reads.fq.gz | samtools view -bS - | samtools sort -n -o hic_to_ref.bam - "

cerebis commented 2 years ago

Yes, sorry.

In reality, both bin3C and qc3C expect a name sorted BAM file. I think the older release (master branch) of bin3C may not make an explicit check for ordering, whereas the newer py3 branch will object if your BAM does not declare a sorting order in the header.

Why this worked without explicitly sorting is because the default out of bwa is effectively name sorted. However, the header is not set to declare it. I am not sure if this ws a conscious developer decision or not -- possibly because bwa does not want to guarantee the order.

orctyr commented 2 years ago

Hi Cerebis,

qc3c bam module log file: DEBUG | 2021-11-23 21:23:39,461 | main | qc3C 0.5 DEBUG | 2021-11-23 21:23:39,463 | main | 3.8.12 | packaged by conda-forge | (default, Oct 12 2021, 21:59:51) [GCC 9.4.0] DEBUG | 2021-11-23 21:23:39,463 | main | Command line: qc3C bam --enzyme Sau3AI --fasta H54_mNGS_megahit.fa --bam hic2ctg-3.bam --output-path qc3c_out INFO | 2021-11-23 21:23:39,469 | qc3C.ligation | No cut-site database cache found for digest involving Sau3AI INFO | 2021-11-23 21:23:39,808 | qc3C.ligation | Building cut-site database against reference sequence and Sau3AI INFO | 2021-11-23 21:24:32,369 | qc3C.ligation | Saving cut-site database for reuse INFO | 2021-11-23 21:24:37,782 | qc3C.bam_based | Accepting all usable reads INFO | 2021-11-23 21:24:37,783 | qc3C.utils | Random seed was not set, using 20098262 INFO | 2021-11-23 21:24:37,783 | qc3C.bam_based | Beginning analysis... INFO | 2021-11-23 21:24:37,783 | qc3C.bam_based | Counting alignments in hic2ctg-3.bam INFO | 2021-11-23 21:27:30,532 | qc3C.bam_based | Found 175,038,961 alignments to analyse INFO | 2021-11-23 22:06:08,170 | qc3C.bam_based | Number of parsed reads: 175,038,961 INFO | 2021-11-23 22:06:08,171 | qc3C.bam_based | Number of analysed reads: 175,038,961 (100.0% of all) INFO | 2021-11-23 22:06:08,171 | qc3C.bam_based | Number of reads filtered [unmapped]: 0 (0.000% of analysed) INFO | 2021-11-23 22:06:08,171 | qc3C.bam_based | Number of reads filtered [low mapq]: 16,349,134 (9.340% of analysed) INFO | 2021-11-23 22:06:08,171 | qc3C.bam_based | Number of reads filtered [ref length]: 0 (0.000% of analysed) INFO | 2021-11-23 22:06:08,171 | qc3C.bam_based | Number of reads filtered [secondary]: 0 (0.000% of analysed) INFO | 2021-11-23 22:06:08,171 | qc3C.bam_based | Number of reads filtered [supplementary]: 0 (0.000% of analysed) INFO | 2021-11-23 22:06:08,171 | qc3C.bam_based | Number of reads filtered [weak mapping]: 11,017,368 (6.294% of analysed) INFO | 2021-11-23 22:06:08,171 | qc3C.bam_based | Number of reads filtered [ref terminated]: 198,634 (0.1135% of analysed) INFO | 2021-11-23 22:06:08,172 | qc3C.bam_based | Number of reads filtered [duplicate]: 0 (0.000% of analysed) INFO | 2021-11-23 22:06:08,172 | qc3C.bam_based | Number of accepted reads: 147,473,825 (84.25% of analysed) INFO | 2021-11-23 22:06:08,172 | qc3C.bam_based | Number of pairs analysed from accepted read pool: 95,256,963 INFO | 2021-11-23 22:06:08,172 | qc3C.bam_based | Number of pairs accepted: 95,256,963 INFO | 2021-11-23 22:06:08,172 | qc3C.bam_based | Number of pairs trans-mapping: 65,765,001 (69.04% of pairs) INFO | 2021-11-23 22:06:08,172 | qc3C.bam_based | Number of pairs cis-mapping: 29,491,962 (30.96% of pairs) INFO | 2021-11-23 22:06:08,172 | qc3C.bam_based | Number of paired reads that fully align: 130,544,122 (68.52% of paired) INFO | 2021-11-23 22:06:08,172 | qc3C.bam_based | Number of paired reads whose alignment terminates early: 59,969,804 (31.48% of paired) INFO | 2021-11-23 22:06:08,172 | qc3C.bam_based | Number of paired reads not ending in cut-site remnant: 154,932,329 (81.32% of paired) INFO | 2021-11-23 22:06:08,172 | qc3C.bam_based | Number of short-range cis-mapping pairs: 28,798,625 (97.65% of cis) INFO | 2021-11-23 22:06:08,172 | qc3C.bam_based | Number of pairs [dangling end]: 24,972,678 (84.68% of cis) INFO | 2021-11-23 22:06:08,172 | qc3C.bam_based | Number of pairs [self-circle]: 2,940,463 (9.970% of cis) INFO | 2021-11-23 22:06:08,172 | qc3C.bam_based | Number of pairs [invalid FF/RR]: 84,408 (0.2862% of cis) INFO | 2021-11-23 22:06:08,172 | qc3C.bam_based | Number of pairs [religation]: 340,469 (1.154% of cis) INFO | 2021-11-23 22:06:08,172 | qc3C.bam_based | Number of pairs [valid FF/RR]: 214,956 (0.7289% of cis) INFO | 2021-11-23 22:06:08,172 | qc3C.bam_based | Number of pairs [valid RF]: 804,755 (2.729% of cis) INFO | 2021-11-23 22:06:08,172 | qc3C.bam_based | Number of pairs [valid FR]: 474,702 (1.610% of cis) INFO | 2021-11-23 22:06:08,172 | qc3C.bam_based | Number of uninformative pairs: 28,338,018 (96.09% of cis) INFO | 2021-11-23 22:06:08,172 | qc3C.bam_based | Number of informative pairs: 1,494,413 (5.067% of cis) INFO | 2021-11-23 22:06:39,712 | qc3C.bam_based | Estimated insert size mean and median: 282nt 281nt INFO | 2021-11-23 22:06:39,712 | qc3C.bam_based | Observed mean read length for paired reads: 147nt INFO | 2021-11-23 22:06:40,098 | qc3C.bam_based | For observed insert size of 281nt, estimated unobserved fraction: 0.01068 INFO | 2021-11-23 22:06:40,099 | qc3C.bam_based | The enzyme Sau3AI has cut-site GATC INFO | 2021-11-23 22:06:40,099 | qc3C.bam_based | The enzymatic combination Sau3AI/Sau3AI produces the 8nt ligation junction GATCGATC INFO | 2021-11-23 22:06:40,099 | qc3C.bam_based | The number of paired reads which began with complete cut-site: 60,145,525 (31.57%) INFO | 2021-11-23 22:06:40,099 | qc3C.bam_based | The digest contains the following possible remnants ['GATC'] INFO | 2021-11-23 22:06:40,099 | qc3C.bam_based | The number of paired reads whose alignment ends with cut-site remnant: 35,581,597 (18.68%) INFO | 2021-11-23 22:06:40,099 | qc3C.bam_based | The number of paired reads with observable read-thru: 30,970,540 (16.26%) INFO | 2021-11-23 22:06:40,099 | qc3C.bam_based | The number of paired reads with read-thru and split alignment: 23,696,167 (12.44%) INFO | 2021-11-23 22:06:40,239 | qc3C.bam_based | Long-range intervals: 1000nt, 5000nt, 10000nt INFO | 2021-11-23 22:06:40,239 | qc3C.bam_based | Number of pairs: 693,337, 369,568, 341,458 INFO | 2021-11-23 22:06:40,239 | qc3C.bam_based | Relative to all cis: 2.351%, 1.253%, 1.158% INFO | 2021-11-23 22:06:40,239 | qc3C.bam_based | Relative to accepted: 0.7279%, 0.388%, 0.3585% INFO | 2021-11-23 22:06:40,239 | qc3C.bam_based | For Sau3AI/Sau3AI junction sequence GATCGATC found: 30970540 INFO | 2021-11-23 22:06:40,239 | qc3C.bam_based | For Sau3AI cutsite remnant GATC found: 35581597

cerebis commented 2 years ago

Metagenomic Hi-C libraries tend to have much less signal that clonal eukaryotic experiments, so low values (<10%) are not uncommon. The signal strength (proportion of informative Hi-C pairs) of your library, however, is a but lower than we'd like to see.

Because of this, I would not raise the minimum signal threshold, as there is not enough signal to spare.

A little breakdown of the qc3C log.

  1. Of 95M accepted pairs, 65M mapped across contigs.

This is good, since you need pairs spanning contigs to bin, however this value is also biased by the degree of assembly fragmentation.

  1. Of the 29M cis-mapping (same contig) pairs, 98% were short-range (that is, pair separation < 1000 bp ).

These basic statistics indicate that the majority of pairs are either not a product of Hi-C proximity ligation or are from very closely occuring loci. There is still utility in Hi-C pairs with small separation, but they lack the same power as long-range pairs in drawing contigs across a chromosome together. We are instead relying more on contacts between tips of adjacent contigs.

  1. The fraction of pairs with observable read-thru is high.

Normally read-thru is a strong indicator of Hi-C products in a library, where the join between loci is traversed by one of the reads (or both) in a pair. The join contains an artefact -- which although can occur naturally -- is also a hint for Hi-C.

The size of the insert relative to read length obviously affects how often this occurs. Since you have 150bp reads and 280 bp mean insert size, we'd expect to see it sometimes. But usually more pronounced overlap is required for frequent observation. Further, seeing read-thru and a secondary split mapping of the remain part of the read is less common. Together, I would suspect you do have a good proportion of Hi-C pairs, only the majority of them short-range.

I see you are using Sau3AI, so I was curious if this library was produced using the Phase kit?

Going forward

I would suggest you use the py3 branch and try creating two contact maps with minimum insert sizes of 700 and then 840 (2.5x and 3x insert mean).

Hopefully you will eliminate the giant cluster with one of these runs.

orctyr commented 2 years ago

Hi Cerebis,

bin3C-python3 version could not be download successfully here. Could I use bin3C-py2 to go forward?

install log: Collecting git+https://github.com/cerebis/bin3C@py3 Cloning https://github.com/cerebis/bin3C (to revision py3) to /tmp/pip-req-build-1vyhcl8u Running command git clone --filter=blob:none -q https://github.com/cerebis/bin3C /tmp/pip-req-build-1vyhcl8u Running command git checkout -b py3 --track origin/py3 fatal: unable to access 'https://github.com/cerebis/bin3C/': Empty reply from server WARNING: Discarding git+https://github.com/cerebis/bin3C@py3. Command errored out with exit status 128: git checkout -b py3 --track origin/py3 Check the logs for full command output. ERROR: Command errored out with exit status 128: git checkout -b py3 --track origin/py3 Check the logs for full command output.