c-zhou / yahs

Yet another Hi-C scaffolding tool

MIT License

131 stars 19 forks source link

Creating .hic and .assembly for editing in juicebox #4

Closed Astahlke closed 2 years ago

Astahlke commented 2 years ago

Hi Chenxi,

It looks like yahs will really speed up our scaffolding efforts - so far the scaffolded fastas are looking great. Awesome work! However, I'm having trouble creating the correct input files for our manual curation phase, editing the scaffolds in Juicebox. Our main goal is to relate the underlying contigs (especially when we use the yahs flag --no-contig-ec) to the assembled scaffolds and hic map.

Using your provided juicebox_pre program and the Juicebox juicer pre, the resulting .hic and .assembly files are not correctly editable in Juicebox. I think SALSA users are having a similar issue https://github.com/marbl/SALSA/issues/154

I think you can only create a draft assembly for editing with run-assembly-visualizer.sh (https://github.com/aidenlab/3d-dna/blob/master/visualize/run-asm-visualizer.sh). Normally our workflow looks like this. makeAgpFromFasta and agp2assembly.py are 3d-dna scripts. Matlock is provided by Phase Genomics - similar concept to your juicebox_pre to convert the alignments to alignments_sorted.txt.

FA=yahs.out_scaffolds_final.fa
makeAgpFromFasta.py $FA genome.agp
agp2assembly.py genome.agp genome.assembly
bwa index $FA
bwa mem -5SP $FA *R1*fastq.gz *R2*fastq.gz | samblaster | samtools view -S -h -b -F 2316 > phasehic.aligned.bam
matlock bam2 juicer phasehic.aligned.bam phasehic.links.txt
sort -k2,2 -k6,6 phasehic.links.txt > phasehic.sorted.links.txt
run-assembly-visualizer.sh -p false genome.assembly phasehic.sorted.links.txt

It seems like I should be able to substitute the alignments_sorted.txt for phasehic.sorted.links.txt in our workflow, but alignments_sorted.txt is missing some columns. Maybe we just need to figure out how to fill these columns?

[amanda.stahlke@ceres yahs]$ head alignments_sorted.txt
0   scaffold_1  100002074   0   1   scaffold_1  143542799   1
0   scaffold_1  1000042 0   1   scaffold_1  1000223 1
0   scaffold_1  100004260   0   1   scaffold_1  100004229   1
0   scaffold_1  100004310   0   1   scaffold_1  100004310   1

[amanda.stahlke@ceres yahs]$ head phasehic.sorted.links
0 scaffold_1 100002220 0 16 scaffold_1 100002439 1 1 - - 1  - - -
0 scaffold_1 100002430 0 16 scaffold_1 100002441 1 1 - - 1  - - -
0 scaffold_1 100002827 0 16 scaffold_1 96197983 1 1 - - 1  - - -
0 scaffold_1 100002871 0 16 scaffold_1 100003104 1 1 - - 1  - - -

Not sure if this is a very clear question. In short, can you provide any guidance on creating a .assembly and .hic file for the Juicebox run-assembly-visualizer.sh tool?

Thank you! Amanda

c-zhou commented 2 years ago

Hi Amanda,

Now juicer_pre supports Juicebox assembly mode with -a parameter. For example, ./juicer_pre -a -o test out.bin scaffolds_final.agp contig.fa.fai, where scaffolds_final.agp is the output of YaHS (it is fine to run without --no-contig-ec). This results in four output files,

test.txt - BED format file for hic links
test.liftover - coordinate file between new and old contigs
test.assembly - assembly annotation file for Juicebox
test.assembly.agp - AGP file contains same information as test.assembly. Not a real AGP file as there are no gaps.

You can run juicer_tools pre with test.txt file - no need to run run-assembly-visualizer.sh. There is a line like [I::main] JUICER_PRE CMD: java -Xmx36G -jar ${juicer_tools} pre test.txt test.hic <(echo "assembly 183277074") in the log telling you how to run juicer_tools. The <(echo "assembly 183277074") part can be replaced by a chromosome size file containing only one line - assembly 183277074. Generally, .hic file in assembly mode should contain only one scaffold, here we call it assembly. There is another line like PRE_C_SIZE: assembly 183277074 in the log telling you the size of the scaffold assembly. The scripts/run_yahs.sh was updated to include an example to run the assembly mode at the end.

Best, Chenxi

c-zhou commented 2 years ago

I should have also mentioned that,

I have problems with Juicebox version 2.x. It looked fine to load the assembly annotation file, but no editing was allowed. Juicerbox version 1.x seems to work properly. I tried v1.8.8, v1.9.8 and v1.11.08.
For juicer_tools pre, it looks to me the latest version 1.22.01 is way slower than the older versions. I changed to v1.9.9 in the example script.
Since all contigs are merged in one scaffold in the assembly mode, the scaffold size easily exceeds the 2G limit of Juicebox. A scale factor was used in this case. Say if we have a 3G genome, the scale factor will be 2, and if we have a 6G genome, the scale factor will be 4 - always a power of 2, which is slightly different from the run-assembly-visualizer.sh script. This will lead to a genome assembly smaller than 2G. This is reflected in the .assembly file as well. But the .assembly.agp file contains the original size. Please remember to take it into account when you try to recover the FASTA file after editing.

Chenxi

gitcruz commented 2 years ago

Hi Chengxi and Amanda,

I have nicely produced an editable combination of .hic and .assembly files for two different vertebrate genomes (using the updated juicer_pre with -a option, etc.). They work pretty well with Juicebox, i can see the "contigs/scaffolds" within each super-scaffold.

I made some edits in one of them (fixing a few within-scaffold translocations and inversions). The I saved the .review.assembly file. However,I would like to generate a new fasta from it and this is hard without the mergenodups.txt file. Should I do something different to this? ./3d-dna/ run-asm-pipeline-post-review.sh –r draft.final.review.assembly draft.fasta mergednodups.txt

Ideally, i would like to generate .assembly, .agp and .hic of every round of curation documenting each decision taken.

Sorry, I am neophyte curator and I feel like I do have loads of related questions that i should post somewhere else. But here a few:

How to produce the new fasta after reviewing the assembly.
Remappings for curation against _yahs_scaffoldsfinal.fa should include supplemental and secondary alignments. Once they are done can i just get the binary file, .assembly and .hic? how?
Could we use yahs on a different assembly (e.g. salsa2) to simply generate a .assembly and .hic for juicebox?

Thanks for the help, Fernando

c-zhou commented 2 years ago

Hi Fernando,

I never used Juicebox curation mode, so I do not know the answers to those questions related. Sorry about that. But for the other questions, I will try to answer.

_2. Remappings for curation against yahs_scaffoldsfinal.fa should include supplemental and secondary alignments. Once they are done can i just get the binary file, .assembly and .hic? how?

From my understanding, I do not think you should map against the scaffolds if you need an editable .hic file. The Juicebox assembly mode always requires the hic reads mapped to contigs. I guess the best way to do this is to generate an AGP file for curated scaffolds and use it as input to juicer_pre. I mean still use ./juicer_pre -a -o test out.bin scaffolds_final.agp contig.fa.fai. The only difference is to change the AGP file. If you want to generate a non-editable .hic file, you can run juicer_pre without -a option. But in this case, the output text file needs to be manually sorted by the second and then sixth column (see the example in scripts/run_yahs.sh). Besides, YaHS does accept BAM files as input. The supplementary and secondary alignments will be all ignored. As well as marked duplicates.

3. Could we use yahs on a different assembly (e.g. salsa2) to simply generate a .assembly and .hic for juicebox?

Yes and no. As long as you have an AGP file, you should be fine to run juicer_pre. However, the AGP file from SALSA2 needs some preprocessing when there are contig breaks. For example, if you have a break ptg000087l_1 3548799, where the first column is the contig id and the second column is the break position, then there will be two lines in the final AGP file. Something like this,

scaffold_5      34451147        37999945        7       W       ptg000087l_1_1  1       3548799 +
scaffold_16     19587586        21092835        5       W       ptg000087l_1_2  1       1505250 -

You see the split contig ended up with two new contigs ptg000087l_1_1 and ptg000087l_1_2. There are two ways to fix this - either remap reads to a contig FASTA file with old contigs replaced by these new contigs or edit the AGP file to a standard format. In this case, it should be something like,

scaffold_5      34451147        37999945        7       W       ptg000087l_1  1       3548799 +
scaffold_16     19587586        21092835        5       W       ptg000087l_1  3548800 5054049 -

Best, Chenxi

Astahlke commented 2 years ago

These directions to use the underlying contigs work great, Chenxi. Thanks for the enhancement and additional tips - I really appreciate it!

gitcruz commented 2 years ago

Hi,

I am consistently getting this error for a mammalian genome (2.5-3Gb) when I try to get the .hic file using juicer_tools. java.lang.RuntimeException: No reads in Hi-C contact matrices. This could be because the MAPQ filter is set too high (-q) or because all reads map to the same fragment. This is a juicer error, that at some point seemed to be solvable with an upgrade (https://github.com/aidenlab/juicer/issues/231#issuecomment-879424793)

I realized i was using the latest version (Juicer Tools Version 1.22.01). Just in case this morning i did a created a new Juicer install where I basically copied previous Juicer installation and replaced juicer_tools.jar in the juicer scripts folder with link to v1.9.9.

wget https://hicfiles.tc4ga.com/public/juicer/juicer_tools.1.9.9_jcuda.0.8.jar ; ln -s juicer_tools.1.9.9_jcuda.0.8.jar juicer_tools.jar

Same error. Although contacts should be there, because the resulting genome assembly is extremely good, by aligning it to a close species chrom-level assembly i see very few rearrangements and it has very similar chromosome sizes (super-scaffolds). I would like to inspect it and verify a couple of inversions. Do you have any idea on how to fix this problem for juicebox compatibility? I know it goes a bit beyond yahs support...but it would be necessary to curate and publish these genomes.

Perhaps is due to the way I am running yahs, not exactly using run_yahs.sh build my own .sh placing the commands step by step (see below). However for other two genomes 1-1.6Gb I got the .hic and .assembly editable files.

Thanks, Fernando

PS. Did not find time yet to get a .fasta from the .review. Want to check mapping all reads to the input assembly with juicer and see if the mnd file works. I'll update on this here as soon as i can

run_yash_fernando.sh _CWD=$PWD;

01. Load YAHS & run assembly with BAM file

export PATH=/scratch/project/devel/aateam/src/yahs_2021_12_21:$PATH;

now link here /scratch/project/devel/aateam/bin/yahs

mkdir -p out;

yahs assembly.fa mapped.PT.name_sorted.bam -o out/yahs

02. Generate .assembly HiC contact maps for juicer

cd out;

#Loading juicer to tolerate -a assembly option
                          #Example: juicer_pre -a -o test out.bin scaffolds_final.agp contig.fa.fai

#Load Juicer
#LOAD MODULES: COREUTILS essential for paralell sort. JAVA to run juicer_tools
   module purge; module load gcc/6.3.0 PYTHON/2.7.11 java/1.8.0u31 COREUTILS/8.21;
   source /home/devel/fcruz/bin/sourcefiles/initialize_conda.sh; conda activate /home/devel/talioto/miniconda3/envs/juicer
   #also add juicer_tools to the PATH
   export PATH=/scratch/project/devel/aateam/src/3d-dna/:/scratch/project/devel/aateam/src/juicer1.6/misc/:/scratch/project/devel/aateam/src/juicer1.6/SLURM/scripts/:$PATH

Generate .assembly for juicebox using - Run Yahs' juicer_pre 2021-12-20 release accepts -a option

 juicer_pre -a -o yahs_juicebox yahs.bin yahs_scaffolds_final.agp  ../assembly.fa.fai

Generate a new .hic file once the length of the total assembly is known

 export PATH=/scratch/project/devel/aateam/bin/:$PATH;
 asmlen=$(fastalength yahs_scaffolds_final.fa | gawk '{t+=$1}END{print t}')
 echo $asmlen ;
 java -jar -Xmx155G /scratch/project/devel/aateam/src/juicer1.6/SLURM/scripts/juicer_tools.jar pre --threads 24  yahs_juicebox.txt yahs_juicebox.hic <(echo "assembly $asmlen")_

c-zhou commented 2 years ago

Hello Fernando,

Sorry for the late reply. I was not very active during the break.

I guess your problem here is $asmlen. You cannot use the original assembly size. The maximum that Juicebox can handle is 2Gb, and apparently, your assembly exceeds this limit. What you should use is the scaled size which can be found in the juicer_pre log file. As aforementioned should be something like PRE_C_SIZE: assembly 183277074. Alternatively, you can calculate from the assembly file where the contig sizes are already scaled.

Best, Chenxi

gitcruz commented 2 years ago

Thanks Chenxi,

I wasn't aware of the 2Gb limit for juicebox (I guess is just to allow visualization). Sorry, because you left all this fairly written in scripts/run_yash.sh

It is a bit annoying to follow up the original coordinates in the scaled view, but now it works and i can make edits on genomes larger than 2Gb!

In my case the scaling is about half the assembly length (2424933882 bp), thus PRE_C_SIZE: assembly 1212463441 - Do this makes sense to you?

I am a bit confused with a small detail. So far I was passing to juicer_pre the contigs.fai and then running juicer_tools as stated above (https://github.com/c-zhou/yahs/issues/4#issuecomment-997497823) without using chrom.sizes at all...

java -jar -Xmx155G juicer_tools.jar pre --threads 24 yahs_juicebox.txt yahs_juicebox.hic <(echo "assembly 1212463441")

But I noticed you also pass a file with the scaled chrom sizes instead the total PRE_C_SIZE assembly.

Why is this? Do this two approaches have different implications?

approach 1 will be dierctly feeding the total scaled length (what i did now): asmlen=$(cat tmp_juicer_pre_JBAT.log | grep "PRE_C_SIZE: assembly" | cut -d' ' -f 3) (${juicer_tools} ${outdir}/${out}_JBAT.txt ${outdir}/${out}_JBAT.hic.part ${outdir}/${out}_JBAT.chrom.sizes) && (mv ${outdir}/${out}_JBAT.hic.part ${outdir}/${out}_JBAT.hic)

approach 2 what your wrote at the end of the run_yash.sh:

this is to generate input file for juicer_tools - assembly (JBAT) mode (-a)

../juicer_pre -a -o ${outdir}/${out}_JBAT ${outdir}/${out}.bin ${outdir}/${out}_scaffolds_final.agp ${contigs}.fai 2>${outdir}/tmp_juicer_pre_JBAT.log cat ${outdir}/tmp_juicer_pre_JBAT.log | grep "PRE_C_SIZE" | cut -d' ' -f2- >${outdir}/${out}_JBAT.chrom.sizes (${juicer_tools} ${outdir}/${out}_JBAT.txt ${outdir}/${out}_JBAT.hic.part ${outdir}/${out}_JBAT.chrom.sizes) && (mv ${outdir}/${out}_JBAT.hic.part ${outdir}/${out}_JBAT.hic)

Thanks again, Fernando

c-zhou commented 2 years ago

In my case the scaling is about half the assembly length (2424933882 bp), thus PRE_C_SIZE: assembly 1212463441 - Do this makes sense to you?

Yes, it makes sense. juicer_pre does something slightly different from the JuiceBox run-assembly-visualizer.sh script where it will always be scaled to 2.1Gb. So for your 2424933882bp assembly, the scale factor used to recover the original coordinates will be 2424933882/2100000000=1.155. juicer_pre will use a number of a power of 2. In your case, it is 2. I do this mainly because it is easier to scale the coordinates back.

For the other question, cat ${outdir}/tmp_juicer_pre_JBAT.log | grep "PRE_C_SIZE" | cut -d' ' -f2- >${outdir}/${out}_JBAT.chrom.sizes; ${juicer_tools} ${outdir}/${out}_JBAT.txt ${outdir}/${out}_JBAT.hic.part ${outdir}/${out}_JBAT.chrom.sizes and ${juicer_tools} ${outdir}/${out}_JBAT.txt ${outdir}/${out}_JBAT.hic.part <(cat ${outdir}/tmp_juicer_pre_JBAT.log | grep "PRE_C_SIZE" | cut -d' ' -f2-) are actually equvilent. It is a bash trick. In the first case, you write a chromosome size file and use as the last parameter for juicer_tools. In the second case, you stream in the chromosome sizes to the command from STDIN (what < does).

Best, Chenxi

c-zhou commented 2 years ago

I justed noticed the way you did it,

asmlen=$(cat tmp_juicer_pre_JBAT.log | grep "PRE_C_SIZE: assembly" | cut -d' ' -f 3) (${juicer_tools} ${outdir}/${out}_JBAT.txt ${outdir}/${out}_JBAT.hic.part ${outdir}/${out}_JBAT.chrom.sizes) && (mv ${outdir}/${out}_JBAT.hic.part ${outdir}/${out}_JBAT.hic)

Did you also generate the ${outdir}/${out}_JBAT.chrom.sizes file as what run_yash.sh does? Otherwise, it does not make sense to me. This will work though (with STDIN stream as aforementioned),

asmlen=$(cat tmp_juicer_pre_JBAT.log | grep "PRE_C_SIZE: assembly" | cut -d' ' -f 3) (${juicer_tools} ${outdir}/${out}_JBAT.txt ${outdir}/${out}_JBAT.hic.part <(echo "assembly ${asmlen}")) && (mv ${outdir}/${out}_JBAT.hic.part ${outdir}/${out}_JBAT.hic)

Chenxi

gitcruz commented 2 years ago

Hi Chenxi,

Thanks for your quick reply.

You're right for approach 1 does not makes sense the way i wrote it. I copied the lines wrong. What I exactly did was: java -jar -Xmx155G juicer_tools.jar pre --threads 24 yahs_juicebox.txt yahs_juicebox.hic <(echo "assembly 1212463441")

So I am just passing the total scaled length to juicer_tools.jar. For another genome I previously stored the scaled size in $asmlen (see below) and pass it the same way with the echo. I was not using the .chrom.sizes at all.

To store asmlen: asmlen=$(cat tmp_juicer_pre_JBAT.log | grep "PRE_C_SIZE: assembly" | cut -d' ' -f 3)

My question is if i should pass all the scaled scaffold lengths (.chrom.sizes) or it works well using <(echo "assembly $asmlen") instead. That is what i understood here https://github.com/c-zhou/yahs/issues/4#issuecomment-997497823 when you wrote:

java -Xmx36G -jar ${juicer_tools} pre test.txt test.hic <(echo "assembly 183277074")

Does it work equally by passing the total scaled length?

Thanks, Fernando

c-zhou commented 2 years ago

Yes, the total scaled size should be used. java -jar -Xmx155G juicer_tools.jar pre --threads 24 yahs_juicebox.txt yahs_juicebox.hic <(echo "assembly 1212463441") is the right way to do it. In terms of run_yahs.sh, cat ${outdir}/tmp_juicer_pre_JBAT.log | grep "PRE_C_SIZE" | cut -d' ' -f2- >${outdir}/${out}_JBAT.chrom.sizes will write a single line assembly 1212463441 to the .chrom.sizes file, so it is the same as <(echo "assembly $asmlen").

Chenxi

gitcruz commented 2 years ago

Great,

I wasn't understanding well the run_yahs.sh command.

Thanks for clarifying, Fernando

schellt commented 2 years ago

Hi @c-zhou, thank you for this very nice tool and the implementation for editable hic files compatible with juicebox. Right now, I am still struggling to create the fasta file after manual curation in juicebox. Do you, @Astahlke or @gitcruz, have an idea how to achieve this? The standard workflow I used before is to export the curated assembly file from juicebox and run run-asm-pipeline-post-review.sh from 3d-dna. To do so, one needs (besides the fasta and the reviewed assembly file) the information of the Hi/Omni-C read mapping as merged_nodups.txt file which is produced by juicer. After checking how this merged_nodups.txt file is created in juicer, I tried to convert the bam file but run-asm-pipeline-post-review.sh crashes and I can't figure out why. I would be very thankful for any help.

c-zhou commented 2 years ago

Hi @schellt @Astahlke @gitcruz,

In the latest commit https://github.com/c-zhou/yahs/commit/63baff3f2c82bc4d44b38cf6730a7de352e66239 and https://github.com/c-zhou/yahs/commit/c854cb764cb3f50056d576baf4c208e4cb7d99fa, I tried to make YaHS more friendly to manual editing with Juicebox JBAT. Changes including,

the previous juicer_pre tool was renamed to juicer. Now juicer pre command is equivalent to juicer_pre.
added juicer post command to generate AGP and FASTA files with the review assembly file (generated by JBAT)

Please check Manual curation with Juicebox (JBAT) section in README for detailed information.

I should also point out that, this version is not fully compatible with the previous versions. However, if your total genome assembly size is smaller than 2Gb (where the scale factor is 1), the review assembly file generated by JBAT is all good. You only need to rerun juicer pre command to create those files and run juicer post with the review assembly file. If your genome size is larger than 2Gb, you might need to redo the curation. Alternatively (in case it is way too much work to redo the curation), you can manually edit the review assembly file to change the size of contigs (the third column of those lines starting with >) to match those in the liftover.agp file generated by juicer pre. The contig size in the liftover file should be scale factor times the size in your review assembly (this is not always true due to integer division rounding, and in this case, do make sure the size matches). The contig breaks created in manual editing might be something you need to pay special attention to, where you will find debris lines.

I only did some simple tests, if you find any problems or have any further questions, please let me know.

Best, Chenxi

schellt commented 2 years ago

Hi @c-zhou , thank you very much for this implementation. I just curated a ~700Mb assembly with Juicebox from scratch and everything worked like a charm.

LiaOb21 commented 2 years ago

Hi, Thank you so much for this tool, it is really fast and gives very good results compared to the other tools available!

I am trying to generate the Hi-C contact map to be visualised in JBAT. The first step of converting the HiC alignment file (BIN) to a file required by juicer_tools worked well with the following command:

(juicer pre ../yahs.out.bin ../yahs.out_scaffolds_final.agp ../assembly.fasta.fai | sort -k2,2d -k6,6d -T ./ --parallel=32 -S100G | awk 'NF' > alignments_sorted.txt.part) && (mv alignments_sorted.txt.part alignments_sorted.txt)

However, when trying to go ahead with the following command: (java -jar -Xmx100G ~/bin/juicer_tools.1.9.9_jcuda.0.8.jar pre alignments_sorted.txt out.hic.part ../yahs.out_scaffolds_final.fa.fai) && (mv out.hic.part out.hic)

I obtain this error:

Not including fragment map
Start preprocess
Writing header
Writing body
java.lang.RuntimeException: No reads in Hi-C contact matrices. This could be because the MAPQ filter is set too high (-q) or because all reads map to the same fragment.
    at juicebox.tools.utils.original.Preprocessor$MatrixZoomDataPP.mergeAndWriteBlocks(Preprocessor.java:1650)
    at juicebox.tools.utils.original.Preprocessor$MatrixZoomDataPP.access$000(Preprocessor.java:1419)
    at juicebox.tools.utils.original.Preprocessor.writeMatrix(Preprocessor.java:832)
    at juicebox.tools.utils.original.Preprocessor.writeBody(Preprocessor.java:582)
    at juicebox.tools.utils.original.Preprocessor.preprocess(Preprocessor.java:346)
    at juicebox.tools.clt.old.PreProcessing.run(PreProcessing.java:116)
    at juicebox.tools.HiCTools.main(HiCTools.java:96)

I tried to run the command both using juicer_tools.1.9.9_jcuda.0.8.jar and the last version of juicer_tools.jar, and the error is always the same. I am posting here because I see that @gitcruz got the same error, but I am not running the full pipeline (run_yahs.sh), I am instead proceeding step by step. I know that this error comes from juicer, but I do not really understand what I am doing wrong. I do not think it is a matter of genome size, since my genome should be around 135Mb.

I really appreciate any help! Thank you in advance.

Best regards,

Lia

c-zhou commented 2 years ago

Hello Lia,

Thanks for using YaHS.

In this command java -jar -Xmx100G ~/bin/juicer_tools.1.9.9_jcuda.0.8.jar pre alignments_sorted.txt out.hic.part ../yahs.out_scaffolds_final.fa.fai, you can try to replace the last input ../yahs.out_scaffolds_final.fa.fai with a chromosome size file, which can be generated by taking the first two columns of the index file, i.e, cut -f 1-2 ../yahs.out_scaffolds_final.fa.fai >../yahs.out_scaffolds_final.fa.chrom.sizes. A safer way is to generate it from the juicer pre log file as what the run_yahs.sh script does (by grep "PRE_C_SIZE"), which will deal with >2G chromosomes. Either way is find for your genome.

Best, Chenxi

c-zhou commented 2 years ago

Hi @c-zhou , thank you very much for this implementation. I just curated a ~700Mb assembly with Juicebox from scratch and everything worked like a charm.

Thanks @schellt for your feedback. Great to know it works. Chenxi

c-zhou commented 2 years ago

I will close this issue for now. Feel free to reopen it whenever needed.

gitcruz commented 2 years ago

Hi @LiaOb21,

I also remember having the error below

I obtain this error:


Not including fragment map
Start preprocess
Writing header
Writing body
java.lang.RuntimeException: No reads in Hi-C contact matrices. This could be because the MAPQ filter is set too high (-q) or because all reads map to the same fragment.
  at juicebox.tools.utils.original.Preprocessor$MatrixZoomDataPP.mergeAndWriteBlocks(Preprocessor.java:1650)
  at juicebox.tools.utils.original.Preprocessor$MatrixZoomDataPP.access$000(Preprocessor.java:1419)
  at juicebox.tools.utils.original.Preprocessor.writeMatrix(Preprocessor.java:832)
  at juicebox.tools.utils.original.Preprocessor.writeBody(Preprocessor.java:582)
  at juicebox.tools.utils.original.Preprocessor.preprocess(Preprocessor.java:346)
  at juicebox.tools.clt.old.PreProcessing.run(PreProcessing.java:116)
  at juicebox.tools.HiCTools.main(HiCTools.java:96)

This is more or less what worked for me to get rid of it:

Make sure you install Juicer version 1.6 (or any other recommended buy chenxi).
Then inside this Juicer-for-Yahs/SLURM/scripts directory you should remove juicer_tools.jar
Install the recommended version inside Juicer-for-Yahs/SLURM/scripts by doing: _wget https://hicfiles.tc4ga.com/public/juicer/juicer_tools.1.9.9_jcuda.0.8.jar_
Then link it as juicer_tools.jar inside that directory doing _ln -s juicer_tools.1.9.9_jcuda.0.8.jar juicertools.jar
Yahs will look for _juicertools.jar So before running Yahs export it to the PATH. export PATH=/path-to-Juicer-for-Yahs/SLURM/scripts/:$PATH

Run Yahs again.

Hope it helps, Fernando

LiaOb21 commented 2 years ago

Hi @LiaOb21,

I also remember having the error below

I obtain this error:

Not including fragment map
Start preprocess
Writing header
Writing body
java.lang.RuntimeException: No reads in Hi-C contact matrices. This could be because the MAPQ filter is set too high (-q) or because all reads map to the same fragment.
    at juicebox.tools.utils.original.Preprocessor$MatrixZoomDataPP.mergeAndWriteBlocks(Preprocessor.java:1650)
    at juicebox.tools.utils.original.Preprocessor$MatrixZoomDataPP.access$000(Preprocessor.java:1419)
    at juicebox.tools.utils.original.Preprocessor.writeMatrix(Preprocessor.java:832)
    at juicebox.tools.utils.original.Preprocessor.writeBody(Preprocessor.java:582)
    at juicebox.tools.utils.original.Preprocessor.preprocess(Preprocessor.java:346)
    at juicebox.tools.clt.old.PreProcessing.run(PreProcessing.java:116)
    at juicebox.tools.HiCTools.main(HiCTools.java:96)

This is more or less what worked for me to get rid of it:

1. Make sure you install Juicer version 1.6 (or any other recommended buy chenxi).

2. Then inside this Juicer-for-Yahs/SLURM/scripts directory you should remove juicer_tools.jar

3. Install the recommended version inside Juicer-for-Yahs/SLURM/scripts by doing: _wget https://hicfiles.tc4ga.com/public/juicer/juicer_tools.1.9.9_jcuda.0.8.jar_

4. Then link it as juicer_tools.jar inside that directory doing
   _ln -s juicer_tools.1.9.9_jcuda.0.8.jar  juicer_tools.jar_

5. Yahs will look for _juicer_tools.jar_  So before running Yahs export it to the PATH. export PATH=/path-to-Juicer-for-Yahs/SLURM/scripts/:$PATH

Run Yahs again.

Hope it helps, Fernando

Hi Fernando,

Thank you so much for your help. I resolved following the suggestion of @c-zhou about the chromosome size file. Thank you anyway.

Cheers,

Lia

gitcruz commented 2 years ago

Cool! Fernando

El mar., 21 jun. 2022 11:28, LiaOb21 @.***> escribió:

Hi @LiaOb21 https://github.com/LiaOb21,

I also remember having the error below

I obtain this error:

Not including fragment map Start preprocess Writing header Writing body java.lang.RuntimeException: No reads in Hi-C contact matrices. This could be because the MAPQ filter is set too high (-q) or because all reads map to the same fragment. at juicebox.tools.utils.original.Preprocessor$MatrixZoomDataPP.mergeAndWriteBlocks(Preprocessor.java:1650) at juicebox.tools.utils.original.Preprocessor$MatrixZoomDataPP.access$000(Preprocessor.java:1419) at juicebox.tools.utils.original.Preprocessor.writeMatrix(Preprocessor.java:832) at juicebox.tools.utils.original.Preprocessor.writeBody(Preprocessor.java:582) at juicebox.tools.utils.original.Preprocessor.preprocess(Preprocessor.java:346) at juicebox.tools.clt.old.PreProcessing.run(PreProcessing.java:116) at juicebox.tools.HiCTools.main(HiCTools.java:96)

This is more or less what worked for me to get rid of it:

Make sure you install Juicer version 1.6 (or any other recommended buy chenxi).

Then inside this Juicer-for-Yahs/SLURM/scripts directory you should remove juicer_tools.jar

Install the recommended version inside Juicer-for-Yahs/SLURM/scripts by doing: _wget https://hicfiles.tc4ga.com/public/juicer/juicer_tools.1.9.9_jcuda.0.8.jar_

Then link it as juicer_tools.jar inside that directory doing _ln -s juicer_tools.1.9.9_jcuda.0.8.jar juicertools.jar

Yahs will look for _juicertools.jar So before running Yahs export it to the PATH. export PATH=/path-to-Juicer-for-Yahs/SLURM/scripts/:$PATH

Run Yahs again.

Hope it helps, Fernando

Hi Fernando,

Thank you so much for your help. I resolved following the suggestion of @c-zhou https://github.com/c-zhou about the chromosome size file. Thank you anyway.

Cheers,

Lia

— Reply to this email directly, view it on GitHub https://github.com/c-zhou/yahs/issues/4#issuecomment-1161496288, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB34KVLGQSI5SZNHKFJVQFDVQGDLHANCNFSM5KKZ6YWA . You are receiving this because you were mentioned.Message ID: @.***>

malvaradol commented 4 weeks ago

Hi @c-zhou!!

This thread helped me a lot with creating the files for manual curation. However, after manual curation, I'm still having issues producing a final assembly for a reason that its not directly related to YaHS or the juicer version distributed with it, but I want to know if there's a way around with the post-curation steps.

Overall, the assembly looks good, but I have a lot of sequences that are outside of the contact map (see image). I ran the assembly with Hifiasm with the Hi-C integrated mode, and then ran purge_dups manually setting the cutoffs. After that I mapped the Hi-C reads to the assemblies and ran YaHS. I'm no expert in the matter, so I would like to know if there is a way to remove those sequences outside the contact map (sequences after the 1.5Gb position approx) after manual curation and produce a new Hi-C contact map without those sequences, without having to rerun the whole assembly pipeline again.

Thanks for your insights with this situation, and thanks for YaHS as well, awesome program!!!

Mateo

Isoris commented 3 weeks ago

Is there any way to get the post review done without hic mapping like I use HapHic which is another scaffolder and so I don't have the merged_nodups.txt but I have reviewed my assembly and I have a Hi-C mapping file? or should I rerun juicer pre ?

c-zhou commented 3 weeks ago

Hi Mateo @malvaradol,

Sorry for the delayed reply. You can directly edit the AGP file if you want to remove some sequences from your assembly. For example, if you only want to keep the first 1000 scaffolds, you can find the last line of the scaffold_1000 in your AGP file and do a head to only keep the rows up to that specific line, i.e., something like head -2000 OLD.agp >NEW.agp. Once you get the new AGP file, you can generate a FASTA file with agp_to_fasta and make new HiC maps as above.

A more general approach is to make a file for the list of scaffolds you want to keep, e.g., scaffod_1, scaffold_2,...., one scaffold per line, in the SCF.list. Then you could run something like awk 'NR==FNR{s[$1]=1; next}{if(s[$1]) print}' SCF.list OLD.agp >NEW.agp

Regarding you HiC map, I am not sure what happened to those small fragments. They could be contaminations or could be very repetitive sequences that did not get aligned very well.

Best, Chenxi

Isoris commented 3 weeks ago

Also for the people who have the error about that one particular fragment has a 0 value, you need to change the 0 to 1. So the contig/debris/fragment/entry in the AGP will be 1 base long and not 0 and this will make the script to get fasta from agp work well.

On Tue, Oct 22, 2024, 6:43 PM Chenxi Zhou @.***> wrote:

Hi Mateo @malvaradol https://github.com/malvaradol,

Sorry for the delayed reply. You can directly edit the AGP file if you want to remove some sequences from your assembly. For example, if you only want to keep the first 1000 scaffolds, you can find the last line of the scaffold_1000 in your AGP file and do a head to only keep the rows up to that specific line, i.e., something like head -2000 OLD.agp >NEW.agp. Once you get the new AGP file, you can generate a FASTA file with agp_to_fasta and make new HiC maps as above.

A more general approach is to make a file for the list of scaffolds you want to keep, e.g., scaffod_1, scaffold_2,...., one scaffold per line, in the SCF.list. Then you could run something like awk 'NR==FNR{s[$1]=1; next}{if(s[$1]) print}' SCF.list OLD.agp >NEW.agp

Regarding you HiC map, I am not sure what happened to those small fragments. They could be contaminations or could be very repetitive sequences that did not get aligned very well.

Best, Chenxi

— Reply to this email directly, view it on GitHub https://github.com/c-zhou/yahs/issues/4#issuecomment-2429053590, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASYS5TBF2O6TRAWOYQEVDS3Z4Y26PAVCNFSM6AAAAABQEKZBJOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMRZGA2TGNJZGA . You are receiving this because you commented.Message ID: @.***>

c-zhou commented 3 weeks ago

Thanks for the comment @Isoris.

Regarding your HiC map question above, you need to rerun juicer pre (with HiC alignment file and AGP file) and HiC map building (juicer_tools pre) whenever your AGP/FASTA file changed. I am not very familiar with HapHiC. What is the format of the HiC mapping file? If it is a BAM or BED, you should be able to seamlessly run juicer pre with it.

Chenxi

Isoris commented 3 weeks ago

Hello C-Zhou/Yahs, Yes finally I have got it to work. Thank you.

Le mar. 22 oct. 2024 à 19:59, Chenxi Zhou @.***> a écrit :

Thanks for the comment @Isoris https://github.com/Isoris.

Regarding your HiC map question above, you need to rerun juicer pre (with HiC alignment file and AGP file) and HiC map building (juicer_tools pre) whenever your AGP/FASTA file changed. I am not very familiar with HapHiC. What is the format of the HiC mapping file? If it is a BAM or BED, you should be able to seamlessly run juicer pre with it.

Chenxi

— Reply to this email directly, view it on GitHub https://github.com/c-zhou/yahs/issues/4#issuecomment-2429219043, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASYS5TCIGG4U6R62FV65E73Z4ZD2DAVCNFSM6AAAAABQEKZBJOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMRZGIYTSMBUGM . You are receiving this because you were mentioned.Message ID: @.***>