Open cahuparo opened 3 months ago
Hi @cahuparo
It does seems a bit too good to be true.. In theory, if you have enough heterogeneity, you can phase entire chromosomes but it's hard to believe there were not phasing errors. Hapdup actually does produce the phase block coordinates as output - so you can see how fragmented these are.
You mentioned you are extracting k-mers from Illumina data - is it trio (e.g. paternal and maternal sequencing)? Does the process of building Meryl database has access to hapdup assmeblies?
Best, Misha
Hi @fenderglass,
Thanks! I am glad I asked you about this. Here I am focusing on both the evaluation of phased block coordinates (1) and the creation and validation of k-mer databases (2).
Upon inspecting the phased block coordinates produced by HapDup, we discovered entries with negative block sizes, indicating start positions greater than the end positions for some blocks. This was an unexpected finding that suggested potential data or processing issues but I am not an expert on this so I wanted to get your opinion:
Found negative block sizes in phased_blocks_hp1.bed:
chrom start end description haplotype_concordance block_size
1342 tig00000215 1121071 819781 ContigEnd NaN -301290
1359 tig00000228 51168 17795 ContigEnd NaN -33373
1387 tig00000323 73888 55384 ContigEnd NaN -18504
1395 tig00000368 402353 376823 ContigEnd NaN -25530
1403 tig00000430 409159 322414 ContigEnd NaN -86745
1498 tig00000861 49855 19556 ContigEnd NaN -30299
1562 tig00001057 49659 27517 ContigEnd NaN -22142
Found negative block sizes in phased_blocks_hp2.bed:
chrom start end description haplotype_concordance block_size
1339 tig00000215 1121071 819781 ContigEnd NaN -301290
1356 tig00000228 51173 17800 ContigEnd NaN -33373
1385 tig00000323 73667 55163 ContigEnd NaN -18504
1393 tig00000368 402352 376822 ContigEnd NaN -25530
1401 tig00000430 409158 322414 ContigEnd NaN -86744
1494 tig00000861 49852 19553 ContigEnd NaN -30299
1558 tig00001057 49657 27515 ContigEnd NaN -22142
Haplotype 1 - Min Size: -301290, Max Size: 788068, Average Size: 42778.970725995314
Haplotype 2 - Min Size: -301290, Max Size: 760058, Average Size: 42777.01705882353
Are these the phasing errors? Is it weird that there only this many? Are these the a result of inversion misassemblies or artefacts of the phasing process?
Yes, we used illumina data for that specific strain. No, we don't have trio data. Here is a more detail explanation of my approach:
canu -p isolate_name -d output_directory genomeSize=100m -nanopore-raw input_reads.fastq minReadLength=1000 minOverlapLength=1000 correctedErrorRate=0.010 maxThreads=32 maxMemory=500 useGrid=0 corOutCoverage=40 corMhapSensitivity=high batOptions='-dg 3 -db 3 -dr 1 -ca 500 -cp 50'
minimap2 -x map-ont -t 32 assembly.fasta correctedReads.fasta.gz | gzip -c - > minimap2.paf.gz
pbcstat minimap2.paf.gz
calcuts PB.stat > cutoffs 2>calcults.log
split_fa assembly.fasta > assembly.split.fasta
minimap2 -x asm5 assembly.split.fasta assembly.split.fasta | gzip -c - > assembly.split.self.paf.gz
purge_dups -2 -T cutoffs -c PB.base.cov assembly.split.self.paf.gz > dups.bed 2> purge_dups.log
get_seqs dups.bed assembly.fasta
NextPolish2 is used for genome assembly polishing to improve assembly quality using both long reads (ONT) and short reads (Illumina).
Generate a Database of Repetitive Elements: This step helps in optimizing the mapping of reads, especially in repetitive regions of the genome:
# Count k-mers and create a database
meryl count k=15 output merylDB "${ASSEMBLY_PATH}"
# Identify and export repetitive elements based on k-mer frequency
meryl print greater-than distinct=0.9998 merylDB > repetitive_k15.txt
Polishing with NextPolish2: Use the repetitive elements information during read mapping for polishing.
# Example command to run NextPolish2, incorporating the handling of repetitive elements
nextPolish2 -r -t 32 "${ASSEMBLY_PREFIX}_map.sort.bam" "${ASSEMBLY_PATH}" k21.yak k31.yak > "${POLISH_DIR}/${ASSEMBLY_PREFIX}_polished.fa"
Identifying and managing repetitive elements before polishing can significantly reduce the chances of misassembly or errors in regions with high sequence similarity. By informing the polishing process about these repetitive elements, NextPolish2 can more accurately use the read data for correcting the assembly, leading to a higher quality final genome sequence. "In therory"
ragtag.py scaffold reference.fasta assembly.fasta -r -m 50000 -g 1 -o output_directory -t 32
# Align reads to the assembly using optimized parameters for repetitive regions
winnowmap -t threads -W repetitive_k15.txt -ax map-ont assembly.fasta canucorrectedReads.fastq.gz > output.sam
# Convert SAM to sorted and indexed BAM
samtools sort -@ threads -o mapping.bam output.sam
samtools index mapping.bam
# Run HapDup for haplotype resolution
hapdup --assembly assembly.fasta --bam mapping.bam --out-dir hapdup_output -t threads --rtype ont
The evaluation process involved the following steps:
K-mer databases are generated for both the sequencing reads (ONT and PE) and the haplotyped assemblies using Meryl. For Sequencing Reads (ONT and PE):
# Generate k-mer databases for Illumina (PE) reads for isolates i8733 and i8735
meryl k=18 count *8735*.fastq.gz output i8735_pe.meryl threads=32
# Repeat the process for ONT corrected reads
meryl k=18 count 8735.correctedReads.fasta.gz output i8735_ONT.meryl threads=32
For Haplotyped Assemblies: Use similar commands to generate k-mer databases for each of the haplotyped assemblies.
meryl k=18 count output hapdup_phased_1._1merylDBs hapdup_phased_1.fasta
meryl k=18 count output hapdup_phased_2.merylDBs hapdup_phased_2.fasta
meryl k=18 count output hapdup_dual_1merylDBs hapdup_dual_1.fasta
meryl k=18 count output hapdup_dual_2merylDBs hapdup_dual_2.fasta
Merqury performs quality evaluation by comparing the k-mer composition of the assemblies against the k-mer composition derived from sequencing reads. This comparison helps identify discrepancies and assess the quality and completeness of the assemblies.
hap1_unique.meryl
and hap2_unique.meryl
) for each isolate and assembly type using the meryl difference
command.hap1_ont_intersect.meryl
, hap1_pe_intersect.meryl
, hap2_ont_intersect.meryl
, hap2_pe_intersect.meryl
) using the meryl intersect
command.hap1_combined.meryl
and hap2_combined.meryl
) using the meryl union-sum
command.finally run merqury:
merqury.sh ${isolate}_pe.meryl ${base_name}_hap1.hapmers.meryl ${base_name}_hap2.hapmers.meryl $hap1_assembly $hap2_assembly ${base_name}_hap1_vs_hap2
Thank you for your time and consideration in this complex yet fascinating endeavor towards understanding haplotype-resolved genome assemblies.
Best,
Camilo
Hi Camilo,
Negative phased blocks are definitely unexpected - do you see those in a file in the Margin output dir that ends with .phaseset.bed
?
Hi Misha,
See attached the bed file in txt format because the bed extension can't be attached.
For some of those negative phased blocks:
tig00000368 16316 16316 MissingConcordancy H1-0_H2-0
tig00000368 268646 268646 MissingConcordancy H1-0_H2-0
tig00000368 338003 338003 MissingConcordancy H1-1_H2-1
tig00000368 376817 376821 MissingConcordancy H1-1_H2-0
tig00000368 402351 376821 ContigEnd
tig00000368 16316 16316 MissingConcordancy H1-0_H2-0
tig00000368 268646 268646 MissingConcordancy H1-0_H2-0
tig00000368 338003 338003 MissingConcordancy H1-1_H2-1
tig00000368 376817 376821 MissingConcordancy H1-1_H2-0
tig00000368 402351 376821 ContigEnd
What does that mean? Could you help me understand what is wrong and if the results can be trusted?
Thanks for your time!
Camilo
Hi Camilo,
It seems like a potential issue with Margin output - may be some kind of borderline case. But it should not affect the assembly results, as long as the phasing stats look reasonable (e.g. phasing N50 few 100s kb to Mb, phased block length comparable to genome size).
Hi @fenderglass,
I've adopted a strategy using Illumina reads alongside ONT R10 data to construct and evaluate phased genome assemblies. After assembling with
canu
, I followed withpurged_dups
,NextPolish2
, andRagTag,
lastly I usedhapdup
to produce a dual assembly. I'm now in the process of quality assessment for these "dual" assemblies. This is a diploid genome of highly heterozygous plant pathogen.Workflow and Issue Description: My workflow integrates
meryl
to derive unique hap-mer databases, followed by the generation of blob plots throughMerqury
. However, the resulting plots are exceptionally clean, leading to doubts about the phasing precision and hap-mer authenticity.Approach for Hapmer Creation and Blob Plot Generation:
Hap-mer Creation:
meryl
to count k-mers from Illumina data to create a k-mer database.meryl difference
, producinghap1_unique.meryl
andhap2_unique.meryl
.meryl intersect
to find overlaps with PE reads, creating ahap1_pe_intersect.meryl
(similarly for hap2).meryl union-sum
to get a combined set of hap-mer databases.Blob Plot Creation:
Merqury
was run to assess the quality of the dual assemblies.The concern arises when the blob plots show an overly distinct separation of hap-mers, potentially indicating over-filtering or other issues in the hap-mer generation pipeline.
Concern: The unusually distinct separation of hap-mers in the blob plots makes me think that it is "too good to be meaningful", indicating potential over-filtering or errors in the hap-mer derivation.
Request for Input: I'm reaching out for guidance on deciphering these blob plots and for suggestions on extra validation procedures. I'm particularly interested in any methodological alterations that could mitigate over-filtering or adapt to the unavailability of parental genomes.
Specific Questions:
Thank you for your time and consideration,
Camilo
PS. Here are the commands I used to run
hapdup
:Here are some stats of the assemblies from hapdup:
And these are some quick syntenic relationships between dual hap1 an hap2: