c-zhou / yahs

Yet another Hi-C scaffolding tool
MIT License
131 stars 19 forks source link

There is no longer a need to manually set the scale factor in Juicebox #94

Open zengxiaofei opened 2 months ago

zengxiaofei commented 2 months ago

Dear YaHS Developers and Users,

For large genomes, the .assembly and .hic files generated by YaHS are not fully compatible with Juicebox. Manually setting the scale factor in Juicebox may be necessary. Additionally, Juicebox cannot correctly import and parse a modified assembly (i.e., .review.assembly) in these scenarios. To address this, I made some modifications in juicer.c, asset.c and asset.h to eliminate the need for manually setting the scale factor in Juicebox for YaHS.

This issue arises because the method for calculating the scale factor in YaHS differs from that of Juicebox. Juicebox uses 1 + genome_size / 2,100,000,000 as the scale factor, while YaHS uses the smallest n that fulfills genome_size / 2^n < INT_MAX, resulting in Juicebox being unable to infer the correct scale factor. For example, a genome with a size of 9 Gb will have a scale factor of 5 in Juicebox, whereas YaHS will calculate a scale factor of 2^3 = 8.

I also made another modification for the MAPQ filtering function in juicer.c. In the original version, MAPQ filtering is enabled only when the BAM is queryname sorted. However, the filtering process should also support unsorted BAM files.

The modified version of juicer.c, asset.c and asset.h is available at: https://github.com/zengxiaofei/yahs. Please feel free to use it in your work. I can make a pull request if the developers think it appropriate to merge these changes.

Best regards,

Xiaofei

gunjanpandey commented 1 month ago

Thanks for the wonderful software @c-zhou, and this update @zengxiaofei

The .hic and .assembly files are opening normally (see the figure below), and there seems to be no need to manually set the scale factor. But after manual curation, juicer post is not generating correct sized fasta file - unless I have done something wrong.

Could you please have a look at my script and let me know what went wrong?

#Step 1. create symlink and index the genome
ln -sf "${draft_genome}" contigs.fa
samtools faidx contigs.fa

# Step 2. Perform alignment
bwa-mem2 index contigs.fa
bwa-mem2 mem -t 64 -5SP contigs.fa "${Read1}" "${Read2}" | ${samblaster}/samblaster | \
    samtools view -S -h -b -F 2316 -@4 -m 20G - | samtools sort -n -@4 -m 20G -o bwa_aligned.bam -

# Step 3. Filter alignment
${matlock}/matlock2 bamfilt -i bwa_aligned.bam -o bwa_aligned.filt.bam
ln -sf bwa_aligned.filt.bam hic-to-contigs.bam

# Step 4. Generate HiC file
yahs -e "${enzymes}" -l 10000 contigs.fa hic-to-contigs.bam

# Step 5. Finalize with Juicer
juicer_pre -a -o out_JBAT *.bin yahs.out_scaffolds_final.agp contigs.fa.fai > out_JBAT.log 2>&1
java -Xmx200g -jar /softwares/juicer/CPU/common/juicer_tools_1.22.01.jar pre \
    --threads 64 out_JBAT.txt out_JBAT.hic.part <(cat out_JBAT.log | grep PRE_C_SIZE | awk '{print $2" "$3}') && \
    mv out_JBAT.hic.part out_JBAT.hic

This generated a nice HiC map, which I had to modify a little.

image
#The review assembly file was exported as `out_JBAT.review.assembly` and following command was run.
juicer post -o out_JBAT2 out_JBAT.review.assembly out_JBAT.liftover.agp contigs.fa

While I was expecting a final fasta file of the genome is around 5.8G, the last command generated a file of 399G.

Below are the input contig, intermediate scaffold and final scaffold fasta files.

Could you please let me know what went wrong?

file                         format  type  num_seqs  sum_len          min_len     avg_len        max_len
contigs.fa                   FASTA   DNA   2,998     5,874,639,565    18,744      1,959,519.5    729,307,306
yahs.out_scaffolds_final.fa  FASTA   DNA   2,640     5,874,825,965    1,000       2,225,312.9    1,436,776,977
out_JBAT2.FINAL.fa           FASTA   DNA   2,641     399,381,338,732  24,642,871  151,223,528.5  249,526,050

Thanks in advance.

zengxiaofei commented 1 month ago

Hi @gunjanpandey,

You are currently using an outdated version of juicer_pre instead of my modified version of juicer. Please download and compile my updated version from this link: https://github.com/zengxiaofei/yahs. After that, follow the instructions provided here to generate the .hic and .assembly files.

Best wishes, Xiaofei

gunjanpandey commented 1 month ago

Hi @zengxiaofei,

Thanks for your prompt reply.

When I try to run the pipeline with the latest version (v1.2.1) released 3 days back, I am getting this error.

(java -jar -Xmx320G juicer_tools.1.9.9_jcuda.0.8.jar  pre alignments_sorted.txt out.hic.part scaffolds_final.chrom.sizes) \
&& (mv out.hic.part out.hic)                                         

Not including fragment map                                                                                                       
Start preprocess                                                                                                                 
Writing header
Writing body
java.lang.RuntimeException: No reads in Hi-C contact matrices. This could be because the MAPQ filter is set too high (-q) or beca
use all reads map to the same fragment.
        at juicebox.tools.utils.original.Preprocessor$MatrixZoomDataPP.mergeAndWriteBlocks(Preprocessor.java:1650)
        at juicebox.tools.utils.original.Preprocessor$MatrixZoomDataPP.access$000(Preprocessor.java:1419)
        at juicebox.tools.utils.original.Preprocessor.writeMatrix(Preprocessor.java:832)
        at juicebox.tools.utils.original.Preprocessor.writeBody(Preprocessor.java:582)
        at juicebox.tools.utils.original.Preprocessor.preprocess(Preprocessor.java:346)
        at juicebox.tools.clt.old.PreProcessing.run(PreProcessing.java:116)
        at juicebox.tools.HiCTools.main(HiCTools.java:96)
zengxiaofei commented 1 month ago

I am not the author of YaHS; you may need to ask @c-zhou by creating a new issue. My modified version of juicer is deposited in another repository based on YaHS version YaHS-1.2a.1.patch, which is several commits behind the latest v1.2.1. The author has not yet merged my changes into the latest version.

c-zhou commented 1 month ago

Hi @gunjanpandey,

Your commands seem all good to me. Something might be wrong with the juicer post input files. I can help have a look if you could share those files - out_JBAT.review.assembly, out_JBAT.liftover.agp and contigs.fa.fai (index file is OK, I do not need the sequence file). You can either put them here or email me at cz370@cam.ac.uk.

For the error you saw when running juicer_tools.1.9.9_jcuda.0.8.jar, have you checked if there are data in the alignments_sorted.txt file? If yes, it is probably a juicer_tools bug.

Best, Chenxi

c-zhou commented 1 month ago

Hi Xiaofei @zengxiaofei,

Thank you very much for the updates. They are very useful.

For scaling of the genome size, I have not got a chance to test it yet so did not include it in the new release. May I ask if this applies to all juicer_tools versions? I would certainly like to add this to YaHS at some point.

For the quality score filtering in coordinate-sorted BAMs, the problem is that we cannot know the mapping quality of the mate read when you process a SAM record. For example, if the mapping qualities of the two reads are 10 and 0, we will only see mapping quality 10 when reading the first SAM record. However, we would not want to include it, as the mapping quality of the second read is 0. Because of this, we decided not to allow quality filtering for coordinate-sorted BAMs.

Best, Chenxi

gunjanpandey commented 1 month ago

Hi @c-zhou,

Could you please help me understand the reason for the error below and how to fix it? My scaffolds_final.chrom.sizes file is tsv, and other two input files are also generated normally. And I am running your latest version. Even with @zengxiaofei juicer version the problem persists.

I did not have this problem with your earlier version but there was a different problem. Is there a fix for that?

(java -jar -Xmx320G juicer_tools.1.9.9_jcuda.0.8.jar  pre alignments_sorted.txt out.hic.part scaffolds_final.chrom.sizes) \
&& (mv out.hic.part out.hic)                                         

Not including fragment map                                                                                                       
Start preprocess                                                                                                                 
Writing header
Writing body
java.lang.RuntimeException: No reads in Hi-C contact matrices. This could be because the MAPQ filter is set too high (-q) or beca
use all reads map to the same fragment.
        at juicebox.tools.utils.original.Preprocessor$MatrixZoomDataPP.mergeAndWriteBlocks(Preprocessor.java:1650)
        at juicebox.tools.utils.original.Preprocessor$MatrixZoomDataPP.access$000(Preprocessor.java:1419)
        at juicebox.tools.utils.original.Preprocessor.writeMatrix(Preprocessor.java:832)
        at juicebox.tools.utils.original.Preprocessor.writeBody(Preprocessor.java:582)
        at juicebox.tools.utils.original.Preprocessor.preprocess(Preprocessor.java:346)
        at juicebox.tools.clt.old.PreProcessing.run(PreProcessing.java:116)
        at juicebox.tools.HiCTools.main(HiCTools.java:96)
c-zhou commented 1 month ago

Hi @gunjanpandey,

Have you checked your alignment_sorted.txt file to see if it contains any data?

When I refer to juicer_tools, I mean the Java program developed by the JuiceBox team, not the juicer program coming with YaHS, and unfortunately, I cannot offer much insight into the bug. It has been reported by several users and seems to occur quite unpredictably. You may have better luck using a different version of juicer_tools. You are using version 1.9.9, you could try switching to version 1.22.01, as you did when generating the curation file.

Best, Chenxi

c-zhou commented 1 month ago

Also @gunjanpandey, have you solved the unexpected file (out_JBAT2.FINAL.fa) size problem in juicer post? Chenxi

gunjanpandey commented 1 month ago

thanks @c-zhou, Both problems exist. I am playing with the pipeline now and will update. For the juicer tools (v 1.9.9), I will play with the latest version.

Meanwhile, if you could suggest anything for the file size, out_JBAT2.FINAL.fa / juicer post post would be great.

If I sort it out, I will update.

c-zhou commented 1 month ago

Your commands seem all good to me. Something might be wrong with the juicer post input files. I can help have a look if you could share those files - out_JBAT.review.assembly, out_JBAT.liftover.agp and contigs.fa.fai (index file is OK, I do not need the sequence file). You can either put them here or email me at cz370@cam.ac.uk.

Hi @gunjanpandey, if you could share these files, I can help have a look. Chenxi

c-zhou commented 1 month ago

Hello @gunjanpandey,

Which version of juicer post are you using? I discovered a bug in the latest version (v1.2.1), but it should be fine with version v1.2 and earlier.

Best, Chenxi

gunjanpandey commented 1 month ago

I was using V1.2.1. I will try with an earlier version. Thanks for your prompt reply.

I am trying to sort out the problem. If I am not able to, I will share the files. Otherwise post the fix.

I am also running AllHiC and 3Ddna, which are working fine, but this program gave me the best-looking map and I would like to proceed with it for finalizing the genome. Interestingly, It is aggressive in contig breaking undefault settings. Mabs has generate all 11 ungapped t2t chromosomes from hic, hifi and ul-ont data, but this program still breaks contigs. I am keen to get assembly from this, gapfill with hifi or ul-ont (and polish if need), and compare it with other programs.

c-zhou commented 1 month ago

For the file size problem, you do not need to redo everything from the scratch, only need to rerun the juicer post command (version 1.2 for example). The assembly results from v1.2 and v1.2.1 would be in general identical.

YaHS is sometimes quite sensitive to the HiC signal changes along the contigs which will be considered as assembly errors. This mostly depends on the HiC library you have. If you believe it is too aggressive in breaking contigs, you could try to run it with the --no-contig-ec option.

Best, Chenxi

zengxiaofei commented 1 month ago

Hi Chenxi,

May I ask if this applies to all juicer_tools versions?

I have not tested different versions of juicer_tools yet. But I think they should be compatible, the main compatibility issue comes from Juicebox. Juicebox will automatically infer the scale factor based on the genome size in the .assembly file and the genome size in the .hic file.

For the quality score filtering in coordinate-sorted BAMs, the problem is that we cannot know the mapping quality of the mate read when you process a SAM record. For example, if the mapping qualities of the two reads are 10 and 0, we will only see mapping quality 10 when reading the first SAM record. However, we would not want to include it, as the mapping quality of the second read is 0. Because of this, we decided not to allow quality filtering for coordinate-sorted BAMs.

Yes, I understand that MAPQ filtering is not compatible with coordinate-sorted BAM files. According to the SAM Format Specification, valid values of sorting order of alignments (SO) include unknown, unsorted, queryname, and coordinate. Here I have enabled MAPQ filtering for unsorted BAM. The SAM files produced by bwa mem are now marked as unsorted (https://github.com/lh3/bwa/pull/336), where paired reads appear in adjacent lines, which should also be compatible with MAPQ filtering. However, if you feel this is not appropriate, you can leave it unchanged.

Best regards, Xiaofei