Closed Astahlke closed 2 years ago
Hi Amanda,
Now juicer_pre
supports Juicebox assembly mode with -a
parameter. For example, ./juicer_pre -a -o test out.bin scaffolds_final.agp contig.fa.fai
, where scaffolds_final.agp
is the output of YaHS (it is fine to run without --no-contig-ec
). This results in four output files,
test.assembly
. Not a real AGP file as there are no gaps.You can run juicer_tools pre
with test.txt
file - no need to run run-assembly-visualizer.sh
. There is a line like [I::main] JUICER_PRE CMD: java -Xmx36G -jar ${juicer_tools} pre test.txt test.hic <(echo "assembly 183277074")
in the log telling you how to run juicer_tools
. The <(echo "assembly 183277074")
part can be replaced by a chromosome size file containing only one line - assembly 183277074
. Generally, .hic
file in assembly mode should contain only one scaffold, here we call it assembly
. There is another line like PRE_C_SIZE: assembly 183277074
in the log telling you the size of the scaffold assembly
. The scripts/run_yahs.sh
was updated to include an example to run the assembly mode at the end.
Best, Chenxi
I should have also mentioned that,
juicer_tools pre
, it looks to me the latest version 1.22.01 is way slower than the older versions. I changed to v1.9.9 in the example script.run-assembly-visualizer.sh
script. This will lead to a genome assembly smaller than 2G. This is reflected in the .assembly
file as well. But the .assembly.agp
file contains the original size. Please remember to take it into account when you try to recover the FASTA file after editing.Chenxi
Hi Chengxi and Amanda,
I have nicely produced an editable combination of .hic and .assembly files for two different vertebrate genomes (using the updated juicer_pre with -a option, etc.). They work pretty well with Juicebox, i can see the "contigs/scaffolds" within each super-scaffold.
I made some edits in one of them (fixing a few within-scaffold translocations and inversions). The I saved the .review.assembly file. However,I would like to generate a new fasta from it and this is hard without the mergenodups.txt file. Should I do something different to this? ./3d-dna/ run-asm-pipeline-post-review.sh –r draft.final.review.assembly draft.fasta mergednodups.txt
Ideally, i would like to generate .assembly, .agp and .hic of every round of curation documenting each decision taken.
Sorry, I am neophyte curator and I feel like I do have loads of related questions that i should post somewhere else. But here a few:
How to produce the new fasta after reviewing the assembly.
Remappings for curation against _yahs_scaffoldsfinal.fa should include supplemental and secondary alignments. Once they are done can i just get the binary file, .assembly and .hic? how?
Could we use yahs on a different assembly (e.g. salsa2) to simply generate a .assembly and .hic for juicebox?
Thanks for the help, Fernando
Hi Fernando,
I never used Juicebox curation mode, so I do not know the answers to those questions related. Sorry about that. But for the other questions, I will try to answer.
_2. Remappings for curation against yahs_scaffoldsfinal.fa should include supplemental and secondary alignments. Once they are done can i just get the binary file, .assembly and .hic? how?
From my understanding, I do not think you should map against the scaffolds if you need an editable .hic file. The Juicebox assembly mode always requires the hic reads mapped to contigs. I guess the best way to do this is to generate an AGP file for curated scaffolds and use it as input to juicer_pre
. I mean still use ./juicer_pre -a -o test out.bin scaffolds_final.agp contig.fa.fai
. The only difference is to change the AGP file. If you want to generate a non-editable .hic file, you can run juicer_pre
without -a
option. But in this case, the output text file needs to be manually sorted by the second and then sixth column (see the example in scripts/run_yahs.sh
). Besides, YaHS does accept BAM files as input. The supplementary and secondary alignments will be all ignored. As well as marked duplicates.
3. Could we use yahs on a different assembly (e.g. salsa2) to simply generate a .assembly and .hic for juicebox?
Yes and no. As long as you have an AGP file, you should be fine to run juicer_pre
. However, the AGP file from SALSA2 needs some preprocessing when there are contig breaks. For example, if you have a break ptg000087l_1 3548799
, where the first column is the contig id and the second column is the break position, then there will be two lines in the final AGP file. Something like this,
scaffold_5 34451147 37999945 7 W ptg000087l_1_1 1 3548799 + scaffold_16 19587586 21092835 5 W ptg000087l_1_2 1 1505250 -
You see the split contig ended up with two new contigs ptg000087l_1_1
and ptg000087l_1_2
.
There are two ways to fix this - either remap reads to a contig FASTA file with old contigs replaced by these new contigs or edit the AGP file to a standard format. In this case, it should be something like,
scaffold_5 34451147 37999945 7 W ptg000087l_1 1 3548799 + scaffold_16 19587586 21092835 5 W ptg000087l_1 3548800 5054049 -
Best, Chenxi
These directions to use the underlying contigs work great, Chenxi. Thanks for the enhancement and additional tips - I really appreciate it!
Hi,
I am consistently getting this error for a mammalian genome (2.5-3Gb) when I try to get the .hic file using juicer_tools. java.lang.RuntimeException: No reads in Hi-C contact matrices. This could be because the MAPQ filter is set too high (-q) or because all reads map to the same fragment. This is a juicer error, that at some point seemed to be solvable with an upgrade (https://github.com/aidenlab/juicer/issues/231#issuecomment-879424793)
I realized i was using the latest version (Juicer Tools Version 1.22.01). Just in case this morning i did a created a new Juicer install where I basically copied previous Juicer installation and replaced juicer_tools.jar in the juicer scripts folder with link to v1.9.9.
wget https://hicfiles.tc4ga.com/public/juicer/juicer_tools.1.9.9_jcuda.0.8.jar ; ln -s juicer_tools.1.9.9_jcuda.0.8.jar juicer_tools.jar
Same error. Although contacts should be there, because the resulting genome assembly is extremely good, by aligning it to a close species chrom-level assembly i see very few rearrangements and it has very similar chromosome sizes (super-scaffolds). I would like to inspect it and verify a couple of inversions. Do you have any idea on how to fix this problem for juicebox compatibility? I know it goes a bit beyond yahs support...but it would be necessary to curate and publish these genomes.
Perhaps is due to the way I am running yahs, not exactly using run_yahs.sh build my own .sh placing the commands step by step (see below). However for other two genomes 1-1.6Gb I got the .hic and .assembly editable files.
Thanks, Fernando
PS. Did not find time yet to get a .fasta from the .review. Want to check mapping all reads to the input assembly with juicer and see if the mnd file works. I'll update on this here as soon as i can
run_yash_fernando.sh _CWD=$PWD;
export PATH=/scratch/project/devel/aateam/src/yahs_2021_12_21:$PATH;
mkdir -p out;
yahs assembly.fa mapped.PT.name_sorted.bam -o out/yahs
cd out;
#Loading juicer to tolerate -a assembly option
#Example: juicer_pre -a -o test out.bin scaffolds_final.agp contig.fa.fai
#Load Juicer
#LOAD MODULES: COREUTILS essential for paralell sort. JAVA to run juicer_tools
module purge; module load gcc/6.3.0 PYTHON/2.7.11 java/1.8.0u31 COREUTILS/8.21;
source /home/devel/fcruz/bin/sourcefiles/initialize_conda.sh; conda activate /home/devel/talioto/miniconda3/envs/juicer
#also add juicer_tools to the PATH
export PATH=/scratch/project/devel/aateam/src/3d-dna/:/scratch/project/devel/aateam/src/juicer1.6/misc/:/scratch/project/devel/aateam/src/juicer1.6/SLURM/scripts/:$PATH
juicer_pre -a -o yahs_juicebox yahs.bin yahs_scaffolds_final.agp ../assembly.fa.fai
export PATH=/scratch/project/devel/aateam/bin/:$PATH;
asmlen=$(fastalength yahs_scaffolds_final.fa | gawk '{t+=$1}END{print t}')
echo $asmlen ;
java -jar -Xmx155G /scratch/project/devel/aateam/src/juicer1.6/SLURM/scripts/juicer_tools.jar pre --threads 24 yahs_juicebox.txt yahs_juicebox.hic <(echo "assembly $asmlen")_
Hello Fernando,
Sorry for the late reply. I was not very active during the break.
I guess your problem here is $asmlen
. You cannot use the original assembly size. The maximum that Juicebox can handle is 2Gb, and apparently, your assembly exceeds this limit. What you should use is the scaled size which can be found in the juicer_pre
log file. As aforementioned should be something like PRE_C_SIZE: assembly 183277074
. Alternatively, you can calculate from the assembly
file where the contig sizes are already scaled.
Best, Chenxi
Thanks Chenxi,
I wasn't aware of the 2Gb limit for juicebox (I guess is just to allow visualization). Sorry, because you left all this fairly written in scripts/run_yash.sh
It is a bit annoying to follow up the original coordinates in the scaled view, but now it works and i can make edits on genomes larger than 2Gb!
In my case the scaling is about half the assembly length (2424933882 bp), thus PRE_C_SIZE: assembly 1212463441 - Do this makes sense to you?
I am a bit confused with a small detail. So far I was passing to juicer_pre the contigs.fai and then running juicer_tools as stated above (https://github.com/c-zhou/yahs/issues/4#issuecomment-997497823) without using chrom.sizes at all...
java -jar -Xmx155G juicer_tools.jar pre --threads 24 yahs_juicebox.txt yahs_juicebox.hic <(echo "assembly 1212463441")
But I noticed you also pass a file with the scaled chrom sizes instead the total PRE_C_SIZE assembly.
Why is this? Do this two approaches have different implications?
approach 1 will be dierctly feeding the total scaled length (what i did now): asmlen=$(cat tmp_juicer_pre_JBAT.log | grep "PRE_C_SIZE: assembly" | cut -d' ' -f 3) (${juicer_tools} ${outdir}/${out}_JBAT.txt ${outdir}/${out}_JBAT.hic.part ${outdir}/${out}_JBAT.chrom.sizes) && (mv ${outdir}/${out}_JBAT.hic.part ${outdir}/${out}_JBAT.hic)
approach 2 what your wrote at the end of the run_yash.sh:
../juicer_pre -a -o ${outdir}/${out}_JBAT ${outdir}/${out}.bin ${outdir}/${out}_scaffolds_final.agp ${contigs}.fai 2>${outdir}/tmp_juicer_pre_JBAT.log cat ${outdir}/tmp_juicer_pre_JBAT.log | grep "PRE_C_SIZE" | cut -d' ' -f2- >${outdir}/${out}_JBAT.chrom.sizes (${juicer_tools} ${outdir}/${out}_JBAT.txt ${outdir}/${out}_JBAT.hic.part ${outdir}/${out}_JBAT.chrom.sizes) && (mv ${outdir}/${out}_JBAT.hic.part ${outdir}/${out}_JBAT.hic)
Thanks again, Fernando
In my case the scaling is about half the assembly length (2424933882 bp), thus PRE_C_SIZE: assembly 1212463441 - Do this makes sense to you?
Yes, it makes sense. juicer_pre
does something slightly different from the JuiceBox run-assembly-visualizer.sh
script where it will always be scaled to 2.1Gb. So for your 2424933882bp assembly, the scale factor used to recover the original coordinates will be 2424933882/2100000000=1.155. juicer_pre
will use a number of a power of 2. In your case, it is 2. I do this mainly because it is easier to scale the coordinates back.
For the other question, cat ${outdir}/tmp_juicer_pre_JBAT.log | grep "PRE_C_SIZE" | cut -d' ' -f2- >${outdir}/${out}_JBAT.chrom.sizes; ${juicer_tools} ${outdir}/${out}_JBAT.txt ${outdir}/${out}_JBAT.hic.part ${outdir}/${out}_JBAT.chrom.sizes
and ${juicer_tools} ${outdir}/${out}_JBAT.txt ${outdir}/${out}_JBAT.hic.part <(cat ${outdir}/tmp_juicer_pre_JBAT.log | grep "PRE_C_SIZE" | cut -d' ' -f2-)
are actually equvilent. It is a bash trick. In the first case, you write a chromosome size file and use as the last parameter for juicer_tools
. In the second case, you stream in the chromosome sizes to the command from STDIN (what <
does).
Best, Chenxi
I justed noticed the way you did it,
asmlen=$(cat tmp_juicer_pre_JBAT.log | grep "PRE_C_SIZE: assembly" | cut -d' ' -f 3) (${juicer_tools} ${outdir}/${out}_JBAT.txt ${outdir}/${out}_JBAT.hic.part ${outdir}/${out}_JBAT.chrom.sizes) && (mv ${outdir}/${out}_JBAT.hic.part ${outdir}/${out}_JBAT.hic)
Did you also generate the ${outdir}/${out}_JBAT.chrom.sizes
file as what run_yash.sh
does? Otherwise, it does not make sense to me. This will work though (with STDIN stream as aforementioned),
asmlen=$(cat tmp_juicer_pre_JBAT.log | grep "PRE_C_SIZE: assembly" | cut -d' ' -f 3) (${juicer_tools} ${outdir}/${out}_JBAT.txt ${outdir}/${out}_JBAT.hic.part <(echo "assembly ${asmlen}")) && (mv ${outdir}/${out}_JBAT.hic.part ${outdir}/${out}_JBAT.hic)
Chenxi
Hi Chenxi,
Thanks for your quick reply.
You're right for approach 1 does not makes sense the way i wrote it. I copied the lines wrong. What I exactly did was: java -jar -Xmx155G juicer_tools.jar pre --threads 24 yahs_juicebox.txt yahs_juicebox.hic <(echo "assembly 1212463441")
So I am just passing the total scaled length to juicer_tools.jar. For another genome I previously stored the scaled size in $asmlen (see below) and pass it the same way with the echo. I was not using the .chrom.sizes at all.
To store asmlen: asmlen=$(cat tmp_juicer_pre_JBAT.log | grep "PRE_C_SIZE: assembly" | cut -d' ' -f 3)
My question is if i should pass all the scaled scaffold lengths (.chrom.sizes) or it works well using <(echo "assembly $asmlen") instead. That is what i understood here https://github.com/c-zhou/yahs/issues/4#issuecomment-997497823 when you wrote:
java -Xmx36G -jar ${juicer_tools} pre test.txt test.hic <(echo "assembly 183277074")
Does it work equally by passing the total scaled length?
Thanks, Fernando
Yes, the total scaled size should be used. java -jar -Xmx155G juicer_tools.jar pre --threads 24 yahs_juicebox.txt yahs_juicebox.hic <(echo "assembly 1212463441")
is the right way to do it. In terms of run_yahs.sh
, cat ${outdir}/tmp_juicer_pre_JBAT.log | grep "PRE_C_SIZE" | cut -d' ' -f2- >${outdir}/${out}_JBAT.chrom.sizes
will write a single line assembly 1212463441
to the .chrom.sizes file, so it is the same as <(echo "assembly $asmlen")
.
Chenxi
Great,
I wasn't understanding well the run_yahs.sh command.
Thanks for clarifying, Fernando
Hi @c-zhou,
thank you for this very nice tool and the implementation for editable hic files compatible with juicebox.
Right now, I am still struggling to create the fasta file after manual curation in juicebox.
Do you, @Astahlke or @gitcruz, have an idea how to achieve this?
The standard workflow I used before is to export the curated assembly file from juicebox and run run-asm-pipeline-post-review.sh
from 3d-dna. To do so, one needs (besides the fasta and the reviewed assembly file) the information of the Hi/Omni-C read mapping as merged_nodups.txt
file which is produced by juicer. After checking how this merged_nodups.txt
file is created in juicer, I tried to convert the bam file but run-asm-pipeline-post-review.sh
crashes and I can't figure out why.
I would be very thankful for any help.
Hi @schellt @Astahlke @gitcruz,
In the latest commit https://github.com/c-zhou/yahs/commit/63baff3f2c82bc4d44b38cf6730a7de352e66239 and https://github.com/c-zhou/yahs/commit/c854cb764cb3f50056d576baf4c208e4cb7d99fa, I tried to make YaHS more friendly to manual editing with Juicebox JBAT. Changes including,
juicer_pre
tool was renamed to juicer
. Now juicer pre
command is equivalent to juicer_pre
.juicer post
command to generate AGP and FASTA files with the review assembly file (generated by JBAT) Please check Manual curation with Juicebox (JBAT) section in README for detailed information.
I should also point out that, this version is not fully compatible with the previous versions. However, if your total genome assembly size is smaller than 2Gb (where the scale factor is 1), the review assembly file generated by JBAT is all good. You only need to rerun juicer pre
command to create those files and run juicer post
with the review assembly file. If your genome size is larger than 2Gb, you might need to redo the curation. Alternatively (in case it is way too much work to redo the curation), you can manually edit the review assembly file to change the size of contigs (the third column of those lines starting with >
) to match those in the liftover.agp
file generated by juicer pre
. The contig size in the liftover file should be scale factor
times the size in your review assembly (this is not always true due to integer division rounding, and in this case, do make sure the size matches). The contig breaks created in manual editing might be something you need to pay special attention to, where you will find debris lines.
I only did some simple tests, if you find any problems or have any further questions, please let me know.
Best, Chenxi
Hi @c-zhou , thank you very much for this implementation. I just curated a ~700Mb assembly with Juicebox from scratch and everything worked like a charm.
Hi, Thank you so much for this tool, it is really fast and gives very good results compared to the other tools available!
I am trying to generate the Hi-C contact map to be visualised in JBAT. The first step of converting the HiC alignment file (BIN) to a file required by juicer_tools worked well with the following command:
(juicer pre ../yahs.out.bin ../yahs.out_scaffolds_final.agp ../assembly.fasta.fai | sort -k2,2d -k6,6d -T ./ --parallel=32 -S100G | awk 'NF' > alignments_sorted.txt.part) && (mv alignments_sorted.txt.part alignments_sorted.txt)
However, when trying to go ahead with the following command:
(java -jar -Xmx100G ~/bin/juicer_tools.1.9.9_jcuda.0.8.jar pre alignments_sorted.txt out.hic.part ../yahs.out_scaffolds_final.fa.fai) && (mv out.hic.part out.hic)
I obtain this error:
Not including fragment map
Start preprocess
Writing header
Writing body
java.lang.RuntimeException: No reads in Hi-C contact matrices. This could be because the MAPQ filter is set too high (-q) or because all reads map to the same fragment.
at juicebox.tools.utils.original.Preprocessor$MatrixZoomDataPP.mergeAndWriteBlocks(Preprocessor.java:1650)
at juicebox.tools.utils.original.Preprocessor$MatrixZoomDataPP.access$000(Preprocessor.java:1419)
at juicebox.tools.utils.original.Preprocessor.writeMatrix(Preprocessor.java:832)
at juicebox.tools.utils.original.Preprocessor.writeBody(Preprocessor.java:582)
at juicebox.tools.utils.original.Preprocessor.preprocess(Preprocessor.java:346)
at juicebox.tools.clt.old.PreProcessing.run(PreProcessing.java:116)
at juicebox.tools.HiCTools.main(HiCTools.java:96)
I tried to run the command both using juicer_tools.1.9.9_jcuda.0.8.jar and the last version of juicer_tools.jar, and the error is always the same. I am posting here because I see that @gitcruz got the same error, but I am not running the full pipeline (run_yahs.sh), I am instead proceeding step by step. I know that this error comes from juicer, but I do not really understand what I am doing wrong. I do not think it is a matter of genome size, since my genome should be around 135Mb.
I really appreciate any help! Thank you in advance.
Best regards,
Lia
Hello Lia,
Thanks for using YaHS.
In this command java -jar -Xmx100G ~/bin/juicer_tools.1.9.9_jcuda.0.8.jar pre alignments_sorted.txt out.hic.part ../yahs.out_scaffolds_final.fa.fai
, you can try to replace the last input ../yahs.out_scaffolds_final.fa.fai
with a chromosome size file, which can be generated by taking the first two columns of the index file, i.e, cut -f 1-2 ../yahs.out_scaffolds_final.fa.fai >../yahs.out_scaffolds_final.fa.chrom.sizes
. A safer way is to generate it from the juicer pre
log file as what the run_yahs.sh
script does (by grep "PRE_C_SIZE"
), which will deal with >2G chromosomes. Either way is find for your genome.
Best, Chenxi
Hi @c-zhou , thank you very much for this implementation. I just curated a ~700Mb assembly with Juicebox from scratch and everything worked like a charm.
Thanks @schellt for your feedback. Great to know it works. Chenxi
I will close this issue for now. Feel free to reopen it whenever needed.
Hi @LiaOb21,
I also remember having the error below
I obtain this error:
Not including fragment map Start preprocess Writing header Writing body java.lang.RuntimeException: No reads in Hi-C contact matrices. This could be because the MAPQ filter is set too high (-q) or because all reads map to the same fragment. at juicebox.tools.utils.original.Preprocessor$MatrixZoomDataPP.mergeAndWriteBlocks(Preprocessor.java:1650) at juicebox.tools.utils.original.Preprocessor$MatrixZoomDataPP.access$000(Preprocessor.java:1419) at juicebox.tools.utils.original.Preprocessor.writeMatrix(Preprocessor.java:832) at juicebox.tools.utils.original.Preprocessor.writeBody(Preprocessor.java:582) at juicebox.tools.utils.original.Preprocessor.preprocess(Preprocessor.java:346) at juicebox.tools.clt.old.PreProcessing.run(PreProcessing.java:116) at juicebox.tools.HiCTools.main(HiCTools.java:96)
This is more or less what worked for me to get rid of it:
Run Yahs again.
Hope it helps, Fernando
Hi @LiaOb21,
I also remember having the error below
I obtain this error:
Not including fragment map Start preprocess Writing header Writing body java.lang.RuntimeException: No reads in Hi-C contact matrices. This could be because the MAPQ filter is set too high (-q) or because all reads map to the same fragment. at juicebox.tools.utils.original.Preprocessor$MatrixZoomDataPP.mergeAndWriteBlocks(Preprocessor.java:1650) at juicebox.tools.utils.original.Preprocessor$MatrixZoomDataPP.access$000(Preprocessor.java:1419) at juicebox.tools.utils.original.Preprocessor.writeMatrix(Preprocessor.java:832) at juicebox.tools.utils.original.Preprocessor.writeBody(Preprocessor.java:582) at juicebox.tools.utils.original.Preprocessor.preprocess(Preprocessor.java:346) at juicebox.tools.clt.old.PreProcessing.run(PreProcessing.java:116) at juicebox.tools.HiCTools.main(HiCTools.java:96)
This is more or less what worked for me to get rid of it:
1. Make sure you install Juicer version 1.6 (or any other recommended buy chenxi). 2. Then inside this Juicer-for-Yahs/SLURM/scripts directory you should remove juicer_tools.jar 3. Install the recommended version inside Juicer-for-Yahs/SLURM/scripts by doing: _wget https://hicfiles.tc4ga.com/public/juicer/juicer_tools.1.9.9_jcuda.0.8.jar_ 4. Then link it as juicer_tools.jar inside that directory doing _ln -s juicer_tools.1.9.9_jcuda.0.8.jar juicer_tools.jar_ 5. Yahs will look for _juicer_tools.jar_ So before running Yahs export it to the PATH. export PATH=/path-to-Juicer-for-Yahs/SLURM/scripts/:$PATH
Run Yahs again.
Hope it helps, Fernando
Hi Fernando,
Thank you so much for your help. I resolved following the suggestion of @c-zhou about the chromosome size file. Thank you anyway.
Cheers,
Lia
Cool! Fernando
El mar., 21 jun. 2022 11:28, LiaOb21 @.***> escribió:
Hi @LiaOb21 https://github.com/LiaOb21,
I also remember having the error below
I obtain this error:
Not including fragment map Start preprocess Writing header Writing body java.lang.RuntimeException: No reads in Hi-C contact matrices. This could be because the MAPQ filter is set too high (-q) or because all reads map to the same fragment. at juicebox.tools.utils.original.Preprocessor$MatrixZoomDataPP.mergeAndWriteBlocks(Preprocessor.java:1650) at juicebox.tools.utils.original.Preprocessor$MatrixZoomDataPP.access$000(Preprocessor.java:1419) at juicebox.tools.utils.original.Preprocessor.writeMatrix(Preprocessor.java:832) at juicebox.tools.utils.original.Preprocessor.writeBody(Preprocessor.java:582) at juicebox.tools.utils.original.Preprocessor.preprocess(Preprocessor.java:346) at juicebox.tools.clt.old.PreProcessing.run(PreProcessing.java:116) at juicebox.tools.HiCTools.main(HiCTools.java:96)
This is more or less what worked for me to get rid of it:
Make sure you install Juicer version 1.6 (or any other recommended buy chenxi).
Then inside this Juicer-for-Yahs/SLURM/scripts directory you should remove juicer_tools.jar
Install the recommended version inside Juicer-for-Yahs/SLURM/scripts by doing: _wget https://hicfiles.tc4ga.com/public/juicer/juicer_tools.1.9.9_jcuda.0.8.jar_
Then link it as juicer_tools.jar inside that directory doing _ln -s juicer_tools.1.9.9_jcuda.0.8.jar juicertools.jar
Yahs will look for _juicertools.jar So before running Yahs export it to the PATH. export PATH=/path-to-Juicer-for-Yahs/SLURM/scripts/:$PATH
Run Yahs again.
Hope it helps, Fernando
Hi Fernando,
Thank you so much for your help. I resolved following the suggestion of @c-zhou https://github.com/c-zhou about the chromosome size file. Thank you anyway.
Cheers,
Lia
— Reply to this email directly, view it on GitHub https://github.com/c-zhou/yahs/issues/4#issuecomment-1161496288, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB34KVLGQSI5SZNHKFJVQFDVQGDLHANCNFSM5KKZ6YWA . You are receiving this because you were mentioned.Message ID: @.***>
Hi @c-zhou!!
This thread helped me a lot with creating the files for manual curation. However, after manual curation, I'm still having issues producing a final assembly for a reason that its not directly related to YaHS or the juicer version distributed with it, but I want to know if there's a way around with the post-curation steps.
Overall, the assembly looks good, but I have a lot of sequences that are outside of the contact map (see image). I ran the assembly with Hifiasm with the Hi-C integrated mode, and then ran purge_dups manually setting the cutoffs. After that I mapped the Hi-C reads to the assemblies and ran YaHS. I'm no expert in the matter, so I would like to know if there is a way to remove those sequences outside the contact map (sequences after the 1.5Gb position approx) after manual curation and produce a new Hi-C contact map without those sequences, without having to rerun the whole assembly pipeline again.
Thanks for your insights with this situation, and thanks for YaHS as well, awesome program!!!
Mateo
Is there any way to get the post review done without hic mapping like I use HapHic which is another scaffolder and so I don't have the merged_nodups.txt but I have reviewed my assembly and I have a Hi-C mapping file? or should I rerun juicer pre ?
Hi Mateo @malvaradol,
Sorry for the delayed reply. You can directly edit the AGP file if you want to remove some sequences from your assembly. For example, if you only want to keep the first 1000 scaffolds, you can find the last line of the scaffold_1000
in your AGP file and do a head
to only keep the rows up to that specific line, i.e., something like head -2000 OLD.agp >NEW.agp
. Once you get the new AGP file, you can generate a FASTA file with agp_to_fasta
and make new HiC maps as above.
A more general approach is to make a file for the list of scaffolds you want to keep, e.g., scaffod_1, scaffold_2,...., one scaffold per line, in the SCF.list
. Then you could run something like awk 'NR==FNR{s[$1]=1; next}{if(s[$1]) print}' SCF.list OLD.agp >NEW.agp
Regarding you HiC map, I am not sure what happened to those small fragments. They could be contaminations or could be very repetitive sequences that did not get aligned very well.
Best, Chenxi
Also for the people who have the error about that one particular fragment has a 0 value, you need to change the 0 to 1. So the contig/debris/fragment/entry in the AGP will be 1 base long and not 0 and this will make the script to get fasta from agp work well.
On Tue, Oct 22, 2024, 6:43 PM Chenxi Zhou @.***> wrote:
Hi Mateo @malvaradol https://github.com/malvaradol,
Sorry for the delayed reply. You can directly edit the AGP file if you want to remove some sequences from your assembly. For example, if you only want to keep the first 1000 scaffolds, you can find the last line of the scaffold_1000 in your AGP file and do a head to only keep the rows up to that specific line, i.e., something like head -2000 OLD.agp >NEW.agp. Once you get the new AGP file, you can generate a FASTA file with agp_to_fasta and make new HiC maps as above.
A more general approach is to make a file for the list of scaffolds you want to keep, e.g., scaffod_1, scaffold_2,...., one scaffold per line, in the SCF.list. Then you could run something like awk 'NR==FNR{s[$1]=1; next}{if(s[$1]) print}' SCF.list OLD.agp >NEW.agp
Regarding you HiC map, I am not sure what happened to those small fragments. They could be contaminations or could be very repetitive sequences that did not get aligned very well.
Best, Chenxi
— Reply to this email directly, view it on GitHub https://github.com/c-zhou/yahs/issues/4#issuecomment-2429053590, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASYS5TBF2O6TRAWOYQEVDS3Z4Y26PAVCNFSM6AAAAABQEKZBJOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMRZGA2TGNJZGA . You are receiving this because you commented.Message ID: @.***>
Thanks for the comment @Isoris.
Regarding your HiC map question above, you need to rerun juicer pre (with HiC alignment file and AGP file) and HiC map building (juicer_tools pre) whenever your AGP/FASTA file changed. I am not very familiar with HapHiC. What is the format of the HiC mapping file? If it is a BAM or BED, you should be able to seamlessly run juicer pre
with it.
Chenxi
Hello C-Zhou/Yahs, Yes finally I have got it to work. Thank you.
Le mar. 22 oct. 2024 à 19:59, Chenxi Zhou @.***> a écrit :
Thanks for the comment @Isoris https://github.com/Isoris.
Regarding your HiC map question above, you need to rerun juicer pre (with HiC alignment file and AGP file) and HiC map building (juicer_tools pre) whenever your AGP/FASTA file changed. I am not very familiar with HapHiC. What is the format of the HiC mapping file? If it is a BAM or BED, you should be able to seamlessly run juicer pre with it.
Chenxi
— Reply to this email directly, view it on GitHub https://github.com/c-zhou/yahs/issues/4#issuecomment-2429219043, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASYS5TCIGG4U6R62FV65E73Z4ZD2DAVCNFSM6AAAAABQEKZBJOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMRZGIYTSMBUGM . You are receiving this because you were mentioned.Message ID: @.***>
Hi Chenxi,
It looks like yahs will really speed up our scaffolding efforts - so far the scaffolded fastas are looking great. Awesome work! However, I'm having trouble creating the correct input files for our manual curation phase, editing the scaffolds in Juicebox. Our main goal is to relate the underlying contigs (especially when we use the yahs flag --no-contig-ec) to the assembled scaffolds and hic map.
Using your provided
juicebox_pre
program and the Juiceboxjuicer pre
, the resulting .hic and .assembly files are not correctly editable in Juicebox. I think SALSA users are having a similar issue https://github.com/marbl/SALSA/issues/154I think you can only create a draft assembly for editing with
run-assembly-visualizer.sh
(https://github.com/aidenlab/3d-dna/blob/master/visualize/run-asm-visualizer.sh). Normally our workflow looks like this. makeAgpFromFasta and agp2assembly.py are 3d-dna scripts. Matlock is provided by Phase Genomics - similar concept to your juicebox_pre to convert the alignments to alignments_sorted.txt.It seems like I should be able to substitute the alignments_sorted.txt for phasehic.sorted.links.txt in our workflow, but alignments_sorted.txt is missing some columns. Maybe we just need to figure out how to fill these columns?
Not sure if this is a very clear question. In short, can you provide any guidance on creating a .assembly and .hic file for the Juicebox run-assembly-visualizer.sh tool?
Thank you! Amanda