Open zengxiaofei opened 2 months ago
Thanks for the wonderful software @c-zhou, and this update @zengxiaofei
The .hic and .assembly files are opening normally (see the figure below), and there seems to be no need to manually set the scale factor. But after manual curation, juicer post
is not generating correct sized fasta file - unless I have done something wrong.
Could you please have a look at my script and let me know what went wrong?
#Step 1. create symlink and index the genome
ln -sf "${draft_genome}" contigs.fa
samtools faidx contigs.fa
# Step 2. Perform alignment
bwa-mem2 index contigs.fa
bwa-mem2 mem -t 64 -5SP contigs.fa "${Read1}" "${Read2}" | ${samblaster}/samblaster | \
samtools view -S -h -b -F 2316 -@4 -m 20G - | samtools sort -n -@4 -m 20G -o bwa_aligned.bam -
# Step 3. Filter alignment
${matlock}/matlock2 bamfilt -i bwa_aligned.bam -o bwa_aligned.filt.bam
ln -sf bwa_aligned.filt.bam hic-to-contigs.bam
# Step 4. Generate HiC file
yahs -e "${enzymes}" -l 10000 contigs.fa hic-to-contigs.bam
# Step 5. Finalize with Juicer
juicer_pre -a -o out_JBAT *.bin yahs.out_scaffolds_final.agp contigs.fa.fai > out_JBAT.log 2>&1
java -Xmx200g -jar /softwares/juicer/CPU/common/juicer_tools_1.22.01.jar pre \
--threads 64 out_JBAT.txt out_JBAT.hic.part <(cat out_JBAT.log | grep PRE_C_SIZE | awk '{print $2" "$3}') && \
mv out_JBAT.hic.part out_JBAT.hic
This generated a nice HiC map, which I had to modify a little.
#The review assembly file was exported as `out_JBAT.review.assembly` and following command was run.
juicer post -o out_JBAT2 out_JBAT.review.assembly out_JBAT.liftover.agp contigs.fa
While I was expecting a final fasta file of the genome is around 5.8G, the last command generated a file of 399G.
Below are the input contig, intermediate scaffold and final scaffold fasta files.
Could you please let me know what went wrong?
file format type num_seqs sum_len min_len avg_len max_len
contigs.fa FASTA DNA 2,998 5,874,639,565 18,744 1,959,519.5 729,307,306
yahs.out_scaffolds_final.fa FASTA DNA 2,640 5,874,825,965 1,000 2,225,312.9 1,436,776,977
out_JBAT2.FINAL.fa FASTA DNA 2,641 399,381,338,732 24,642,871 151,223,528.5 249,526,050
Thanks in advance.
Hi @gunjanpandey,
You are currently using an outdated version of juicer_pre
instead of my modified version of juicer
. Please download and compile my updated version from this link: https://github.com/zengxiaofei/yahs. After that, follow the instructions provided here to generate the .hic and .assembly files.
Best wishes, Xiaofei
Hi @zengxiaofei,
Thanks for your prompt reply.
When I try to run the pipeline with the latest version (v1.2.1) released 3 days back, I am getting this error.
(java -jar -Xmx320G juicer_tools.1.9.9_jcuda.0.8.jar pre alignments_sorted.txt out.hic.part scaffolds_final.chrom.sizes) \
&& (mv out.hic.part out.hic)
Not including fragment map
Start preprocess
Writing header
Writing body
java.lang.RuntimeException: No reads in Hi-C contact matrices. This could be because the MAPQ filter is set too high (-q) or beca
use all reads map to the same fragment.
at juicebox.tools.utils.original.Preprocessor$MatrixZoomDataPP.mergeAndWriteBlocks(Preprocessor.java:1650)
at juicebox.tools.utils.original.Preprocessor$MatrixZoomDataPP.access$000(Preprocessor.java:1419)
at juicebox.tools.utils.original.Preprocessor.writeMatrix(Preprocessor.java:832)
at juicebox.tools.utils.original.Preprocessor.writeBody(Preprocessor.java:582)
at juicebox.tools.utils.original.Preprocessor.preprocess(Preprocessor.java:346)
at juicebox.tools.clt.old.PreProcessing.run(PreProcessing.java:116)
at juicebox.tools.HiCTools.main(HiCTools.java:96)
I am not the author of YaHS; you may need to ask @c-zhou by creating a new issue. My modified version of juicer
is deposited in another repository based on YaHS version YaHS-1.2a.1.patch, which is several commits behind the latest v1.2.1. The author has not yet merged my changes into the latest version.
Hi @gunjanpandey,
Your commands seem all good to me. Something might be wrong with the juicer post
input files. I can help have a look if you could share those files - out_JBAT.review.assembly
, out_JBAT.liftover.agp
and contigs.fa.fai
(index file is OK, I do not need the sequence file). You can either put them here or email me at cz370@cam.ac.uk.
For the error you saw when running juicer_tools.1.9.9_jcuda.0.8.jar
, have you checked if there are data in the alignments_sorted.txt
file? If yes, it is probably a juicer_tools bug.
Best, Chenxi
Hi Xiaofei @zengxiaofei,
Thank you very much for the updates. They are very useful.
For scaling of the genome size, I have not got a chance to test it yet so did not include it in the new release. May I ask if this applies to all juicer_tools versions? I would certainly like to add this to YaHS at some point.
For the quality score filtering in coordinate-sorted BAMs, the problem is that we cannot know the mapping quality of the mate read when you process a SAM record. For example, if the mapping qualities of the two reads are 10 and 0, we will only see mapping quality 10 when reading the first SAM record. However, we would not want to include it, as the mapping quality of the second read is 0. Because of this, we decided not to allow quality filtering for coordinate-sorted BAMs.
Best, Chenxi
Hi @c-zhou,
Could you please help me understand the reason for the error below and how to fix it? My scaffolds_final.chrom.sizes file is tsv, and other two input files are also generated normally. And I am running your latest version. Even with @zengxiaofei juicer version the problem persists.
I did not have this problem with your earlier version but there was a different problem. Is there a fix for that?
(java -jar -Xmx320G juicer_tools.1.9.9_jcuda.0.8.jar pre alignments_sorted.txt out.hic.part scaffolds_final.chrom.sizes) \
&& (mv out.hic.part out.hic)
Not including fragment map
Start preprocess
Writing header
Writing body
java.lang.RuntimeException: No reads in Hi-C contact matrices. This could be because the MAPQ filter is set too high (-q) or beca
use all reads map to the same fragment.
at juicebox.tools.utils.original.Preprocessor$MatrixZoomDataPP.mergeAndWriteBlocks(Preprocessor.java:1650)
at juicebox.tools.utils.original.Preprocessor$MatrixZoomDataPP.access$000(Preprocessor.java:1419)
at juicebox.tools.utils.original.Preprocessor.writeMatrix(Preprocessor.java:832)
at juicebox.tools.utils.original.Preprocessor.writeBody(Preprocessor.java:582)
at juicebox.tools.utils.original.Preprocessor.preprocess(Preprocessor.java:346)
at juicebox.tools.clt.old.PreProcessing.run(PreProcessing.java:116)
at juicebox.tools.HiCTools.main(HiCTools.java:96)
Hi @gunjanpandey,
Have you checked your alignment_sorted.txt
file to see if it contains any data?
When I refer to juicer_tools
, I mean the Java program developed by the JuiceBox team, not the juicer
program coming with YaHS, and unfortunately, I cannot offer much insight into the bug. It has been reported by several users and seems to occur quite unpredictably. You may have better luck using a different version of juicer_tools. You are using version 1.9.9, you could try switching to version 1.22.01, as you did when generating the curation file.
Best, Chenxi
Also @gunjanpandey, have you solved the unexpected file (out_JBAT2.FINAL.fa
) size problem in juicer post
? Chenxi
thanks @c-zhou, Both problems exist. I am playing with the pipeline now and will update. For the juicer tools (v 1.9.9), I will play with the latest version.
Meanwhile, if you could suggest anything for the file size, out_JBAT2.FINAL.fa / juicer post
post would be great.
If I sort it out, I will update.
Your commands seem all good to me. Something might be wrong with the juicer post input files. I can help have a look if you could share those files - out_JBAT.review.assembly, out_JBAT.liftover.agp and contigs.fa.fai (index file is OK, I do not need the sequence file). You can either put them here or email me at cz370@cam.ac.uk.
Hi @gunjanpandey, if you could share these files, I can help have a look. Chenxi
Hello @gunjanpandey,
Which version of juicer post are you using? I discovered a bug in the latest version (v1.2.1), but it should be fine with version v1.2 and earlier.
Best, Chenxi
I was using V1.2.1. I will try with an earlier version. Thanks for your prompt reply.
I am trying to sort out the problem. If I am not able to, I will share the files. Otherwise post the fix.
I am also running AllHiC and 3Ddna, which are working fine, but this program gave me the best-looking map and I would like to proceed with it for finalizing the genome. Interestingly, It is aggressive in contig breaking undefault settings. Mabs has generate all 11 ungapped t2t chromosomes from hic, hifi and ul-ont data, but this program still breaks contigs. I am keen to get assembly from this, gapfill with hifi or ul-ont (and polish if need), and compare it with other programs.
For the file size problem, you do not need to redo everything from the scratch, only need to rerun the juicer post
command (version 1.2 for example). The assembly results from v1.2 and v1.2.1 would be in general identical.
YaHS is sometimes quite sensitive to the HiC signal changes along the contigs which will be considered as assembly errors. This mostly depends on the HiC library you have. If you believe it is too aggressive in breaking contigs, you could try to run it with the --no-contig-ec
option.
Best, Chenxi
Hi Chenxi,
May I ask if this applies to all juicer_tools versions?
I have not tested different versions of juicer_tools yet. But I think they should be compatible, the main compatibility issue comes from Juicebox. Juicebox will automatically infer the scale factor based on the genome size in the .assembly file and the genome size in the .hic file.
For the quality score filtering in coordinate-sorted BAMs, the problem is that we cannot know the mapping quality of the mate read when you process a SAM record. For example, if the mapping qualities of the two reads are 10 and 0, we will only see mapping quality 10 when reading the first SAM record. However, we would not want to include it, as the mapping quality of the second read is 0. Because of this, we decided not to allow quality filtering for coordinate-sorted BAMs.
Yes, I understand that MAPQ filtering is not compatible with coordinate-sorted BAM files. According to the SAM Format Specification, valid values of sorting order of alignments (SO) include unknown, unsorted, queryname, and coordinate. Here I have enabled MAPQ filtering for unsorted BAM. The SAM files produced by bwa mem are now marked as unsorted (https://github.com/lh3/bwa/pull/336), where paired reads appear in adjacent lines, which should also be compatible with MAPQ filtering. However, if you feel this is not appropriate, you can leave it unchanged.
Best regards, Xiaofei
Dear YaHS Developers and Users,
For large genomes, the .assembly and .hic files generated by YaHS are not fully compatible with Juicebox. Manually setting the scale factor in Juicebox may be necessary. Additionally, Juicebox cannot correctly import and parse a modified assembly (i.e.,
.review.assembly
) in these scenarios. To address this, I made some modifications injuicer.c
,asset.c
andasset.h
to eliminate the need for manually setting the scale factor in Juicebox for YaHS.This issue arises because the method for calculating the scale factor in YaHS differs from that of Juicebox. Juicebox uses
1 + genome_size / 2,100,000,000
as the scale factor, while YaHS uses the smallestn
that fulfillsgenome_size / 2^n < INT_MAX
, resulting in Juicebox being unable to infer the correct scale factor. For example, a genome with a size of 9 Gb will have a scale factor of5
in Juicebox, whereas YaHS will calculate a scale factor of2^3 = 8
.I also made another modification for the MAPQ filtering function in
juicer.c
. In the original version, MAPQ filtering is enabled only when the BAM is queryname sorted. However, the filtering process should also support unsorted BAM files.The modified version of
juicer.c
,asset.c
andasset.h
is available at: https://github.com/zengxiaofei/yahs. Please feel free to use it in your work. I can make a pull request if the developers think it appropriate to merge these changes.Best regards,
Xiaofei