ShunOuchi / GreenHill

De novo chromosome-level scaffolding and phasing tool using Hi-C
GNU General Public License v3.0
27 stars 2 forks source link

How to edit the assembly in the juicebox app? #32

Closed Isoris closed 4 months ago

Isoris commented 8 months ago

Hello,

I have finished running GreenHill and I separated the haplotypes based on hap0 and hap1 then after running juicer pre -s "DpnII" followed by run asm visualizer from the aiden lab, I get inter.hic in the aligned folder, base.fa base.assembly and base.ctg_info.assembly

I would like to know how to manually edit the assembly, we need to rename the inter.hic into base.hic then open it as a map on juicebox and also open the base.assembly in the juicebox and then export to modified assembly file and run the post-review, is that correct?

Could you please briefly explain how do you load the data and perform the manual curation of scaffolds within chromosome boundaries because I don't understand what steps to perform and in which order. .

#! /bin/bash
#SBATCH -p memory
#SBATCH -N 1 -n 80
#SBATCH --mem=120GB
#SBATCH -t 5-00:00:00
#SBATCH -A proj5034
#SBATCH -J review_greenhill_with_juicer_CMA

source ~/.bashrc
mamba activate /tarafs/data/project/proj5057-AGBKUB/13-programs/env

path_juicer="/tarafs/data/project/proj5057-AGBKUB/13-programs/juicer/CPU"
path_3d="/tarafs/data/project/proj5057-AGBKUB/15-pipelines/3d-dna-201008"
path_greenhill="/tarafs/data/project/proj5057-AGBKUB/13-programs/GreenHill-1.1.0"

echo "$path_juicer"
echo "$path_3d"
echo "$path_greenhill"

seqkit sort -lr query/GREENHILL_CMA_HIC_UL_hap0.filtered100000.fa >base.fa
bwa index base.fa >bwa_index.log 2>&1
seqkit fx2tab -nl base.fa >base.sizes

juicer.sh -D $path_juicer -d $PWD -g base -s "DpnII" -z base.fa -p base.sizes >juicer.log.o 2>juicer.log.e
awk -f ${path_3d}/utils/generate-assembly-file-from-fasta.awk base.fa >base.assembly 2>generate.log.e
${path_3d}/visualize/run-assembly-visualizer.sh base.assembly aligned/merged_nodups.txt >visualizer.log.o 2>visualizer.log.e
python ${path_greenhill}/utils/fasta_to_juicebox_assembly.py base.fa >base.ctg_info.assembly

Thank you very much in advance for your help. Quentin.

ShunOuchi commented 7 months ago

Please load base.hic and base.ctg_info.assembly in Juicebox and make the modifications manually. Then export the review assembly and create a fasta file with run-asm-pipeline-post-review.sh.

Thank you

Isoris commented 7 months ago

So I need to juicer pre to get base.hic right?

Ok thank you I will try and let you know if that works well for me thank you much. 😁🙏🏻

Isoris commented 7 months ago

Hi again,

In another post you suggested to use the out_Afterphase.fa for juicer pre.

Here's what I have got:

image

You see for some pairs of pseudochromosomes for instance ChrA from haplotype0 will interact with ChrA' from haplotype1 so it means that the assembly is correct.

But for some pairs of pseudochromosomes there is no HiC interractions (is it a gap?) and also some pseudochromosomes boundaries in blue are not aligned on the HiC map interractions?

  1. AM I supposed to remove the gaps (the green boundaries in which there is no red HiC interractions? or is it cause by the inability of HiC to map on the Centromeres? which I dont believe because in the Other HiC experiments ive never seen anything like that) image

  2. For the small scaffolds in blue on the lower right (which are small chunks and not assembled debris but which have interractions with both haplotypes: AM I supposed to move them in the chromosomes? and so I need to duplicate them? Or Do I need to work only on each haplotype separately but use the Out_afterphase base.hic as a blueprint?

image

Thanks in advance for your help.

ShunOuchi commented 7 months ago

But for some pairs of pseudochromosomes there is no HiC interractions (is it a gap?) and also some pseudochromosomes boundaries in blue are not aligned on the HiC map interractions? AM I supposed to remove the gaps (the green boundaries in which there is no red HiC interractions? or is it cause by the inability of HiC to map on the Centromeres? which I dont believe because in the Other HiC experiments ive never seen anything like that)

The reason there is no red Hi-C interractions is that juicer's mapping results do not include MAPQ=0 results (i.e. non-unique mapping results). In the case of haplotype resolved genomes, there are regions that appear to have no Hi-C interactions because Hi-C reads cannot uniquely map to the regions with high homology between haplotypes (not constructed separately in GreenHill). If you remap Hi-C with MAPQ≥0, you will see that they are not a GAP.

For the small scaffolds in blue on the lower right (which are small chunks and not assembled debris but which have interractions with both haplotypes: AM I supposed to move them in the chromosomes? and so I need to duplicate them? Or Do I need to work only on each haplotype separately but use the Out_afterphase base.hic as a blueprint?

Such contigs should be determined to be duplicated based on their coverage. If coverage is homo, the contig should be duplicated and moved to both haplotypes. If coverage is hetero, the contig should be moved to only one haplotype.

Thank you

Isoris commented 6 months ago

But for some pairs of pseudochromosomes there is no HiC interractions (is it a gap?) and also some pseudochromosomes boundaries in blue are not aligned on the HiC map interractions? AM I supposed to remove the gaps (the green boundaries in which there is no red HiC interractions? or is it cause by the inability of HiC to map on the Centromeres? which I dont believe because in the Other HiC experiments ive never seen anything like that)

The reason there is no red Hi-C interractions is that juicer's mapping results do not include MAPQ=0 results (i.e. non-unique mapping results). In the case of haplotype resolved genomes, there are regions that appear to have no Hi-C interactions because Hi-C reads cannot uniquely map to the regions with high homology between haplotypes (not constructed separately in GreenHill). If you remap Hi-C with MAPQ≥0, you will see that they are not a GAP.

For the small scaffolds in blue on the lower right (which are small chunks and not assembled debris but which have interractions with both haplotypes: AM I supposed to move them in the chromosomes? and so I need to duplicate them? Or Do I need to work only on each haplotype separately but use the Out_afterphase base.hic as a blueprint?

Such contigs should be determined to be duplicated based on their coverage. If coverage is homo, the contig should be duplicated and moved to both haplotypes. If coverage is hetero, the contig should be moved to only one haplotype.

Thank you

Hello, Thank you for the answer, and do you recommend editing the HiC map on separate haplotypes or directly in the out_final file? I tried to do it and it was quite long can I also do it separately for hap0 and hap1? is it the same ? Ok I will tell you once i'm done but for now It looks very good thank you so much for your help.

github-actions[bot] commented 4 months ago

Stale issue message