c-zhou / yahs

Yet another Hi-C scaffolding tool
MIT License
122 stars 18 forks source link

Same dataset works as .bam file but not .bed file #52

Open conchoecia opened 1 year ago

conchoecia commented 1 year ago

yahs version: v1.2a.2 chromap version: v0.2.4-r467

hello, if I use chromap to generate both a .sam and the .bed file from the same Hi-C dataset, then yahs only works with the .bam file.

yahs calculates the coverage of the .bam file as 6x, and it calculates the coverage of the .bed file is 2x. The actual coverage of the reads is 25x (Genome is ~500Mbp, there are ~50 million 2x150 read pairs).

The crash in yahs with the .bed file occurs at this coverage-calculating stage, and the log is pasted below. The command was yahs -o test --no-contig-ec --no-scaffold-ec --no-mem-check myassembly.fasta my.bed. Just to reiterate, when I ran this on a .bam dataset everything worked and the run resulted in a scaffolded .fasta.

[I::main] dump hic links (BED) to binary file test.bin
[I::dump_links_from_bed_file] 1 million records processed, 499999 read pairs
[I::dump_links_from_bed_file] 2 million records processed, 999999 read pairs
[I::dump_links_from_bed_file] 3 million records processed, 1499999 read pairs
[I::dump_links_from_bed_file] 4 million records processed, 1999999 read pairs
.
.
.
[I::bed_cstats] 24 million records processed
[I::bed_cstats] 25 million records processed
[I::bed_cstats] 26 million records processed
[I::bed_cstats] 27 million records processed
[I::bed_cstats] 28 million records processed
[I::bed_cstats] 29 million records processed
[I::bed_cstats] 30 million records processed
[I::bed_cstats] position compression n = 60327756, m = 41329348, max_m = 134217727
[I::bed_cstats] total number BED records processed 30163878
[I::calc_avg_cov] sequence coverage stats:
[I::calc_avg_cov] sequence bases: 573897946
[I::calc_avg_cov] read bases: 1508193900
[I::calc_avg_cov] q drop: 0.100
[I::calc_avg_cov] average read coverage: 2.183
yahs: cov.c:474: calc_cov_norms: Assertion `k == div_ceil(p1, window)' failed.
Aborted

Please let me know if you would like more information!

c-zhou commented 1 year ago

Hello @conchoecia,

Was your BAM file sorted by read names? The difference is expected if the BAM file was unsorted or sorted by coordinates, in which case the alignment quality scores were not considered. The default value 10 (-q option) is applied for BED input and BAM input sorted by read names. See Note 1.

It seems the majority (20X) of your HiC reads was not mapped. You might double-check to see if this is the case and probably good to figure out the reason.

The error seems to be a bug triggered by a corner case. Unfortunately, I do not have enough information to locate it.

Best, Chenxi

conchoecia commented 1 year ago

Hi @c-zhou, the bam was sorted by coordinates. I will change my pipeline to sort by reads next time (I used samtools sort without -n).

These aren't my reads or my assembly, so there isn't much I can do. I expect that it is due to the assembly quality being poor, and based on CLR rather than HiFi reads. I consistently notice that Hi-C mapping is poorer in CLR-based assemblies of the highly heterozygous species I work on. The Hi-C map looks copacetic though, so the Hi-C library construction seems fine.

Please let me know if you would like any files to suss out the bug. Thank you for maintaining this software! :)

andreaschavez commented 1 year ago

Hi conchoecia: I'm not sure if this helps, but I had the same error when I used a BAM file that was made using a slightly different reference genome that the genome used in Yahs. My error went away once I used the correct reference genome in YAHS as the genome the BAM file was originally made from. Andreas

conchoecia commented 1 year ago

Hi @andreaschavez - Thanks for the suggestion. This wasn't the case, unfortunately, as I was using a pipeline that referenced the same fasta reference for both mapping and YAHS correction.

c-zhou commented 1 year ago

Hello @conchoecia,

Sorry for the delayed response. I was away in the past weeks.

I would appreciate it if you could share the data. The BIN file and the contig FASTA index file would be enough. I am not sure what the best way is for you to share them. My email address is cz370@cam.ac.uk if you want to discuss this by email.

Best, Chenxi