KolmogorovLab / hapdup

Pipeline to convert a haploid assembly into diploid
Other
85 stars 8 forks source link

Too good to be true? understanding hapdup results #42

Open cahuparo opened 3 months ago

cahuparo commented 3 months ago

Hi @fenderglass,

I've adopted a strategy using Illumina reads alongside ONT R10 data to construct and evaluate phased genome assemblies. After assembling with canu, I followed with purged_dups, NextPolish2, and RagTag, lastly I used hapdup to produce a dual assembly. I'm now in the process of quality assessment for these "dual" assemblies. This is a diploid genome of highly heterozygous plant pathogen.

Workflow and Issue Description: My workflow integrates meryl to derive unique hap-mer databases, followed by the generation of blob plots through Merqury. However, the resulting plots are exceptionally clean, leading to doubts about the phasing precision and hap-mer authenticity.

Approach for Hapmer Creation and Blob Plot Generation:

  1. Hap-mer Creation:

    • Used meryl to count k-mers from Illumina data to create a k-mer database.
    • Differentiating k-mers into haplotype-specific sets using meryl difference, producing hap1_unique.meryl and hap2_unique.meryl.
    • Intersecting unique haplotype k-mers with read data using meryl intersect to find overlaps with PE reads, creating a hap1_pe_intersect.meryl (similarly for hap2).
    • Merging the intersected k-mer sets with meryl union-sum to get a combined set of hap-mer databases.
  2. Blob Plot Creation:

    • With the combined hapmer databases, Merqury was run to assess the quality of the dual assemblies.
    • Merqury used the hapmer databases to calculate QV (Quality Value) scores and generate blob plots.
head i8735_canu_hetopt_purged_polished_scaffolded_hapdup_dual_hap1_vs_hap2.qv
i8735_canu_hetopt_purged_polished_scaffolded_hapdup_dual_1      207900  134365932       40.6539 8.60222e-05
i8735_canu_hetopt_purged_polished_scaffolded_hapdup_dual_2      203620  134563709       40.7507 8.41261e-05
Both    411520  268929641       40.7021 8.50734e-05

The concern arises when the blob plots show an overly distinct separation of hap-mers, potentially indicating over-filtering or other issues in the hap-mer generation pipeline.

Concern: The unusually distinct separation of hap-mers in the blob plots makes me think that it is "too good to be meaningful", indicating potential over-filtering or errors in the hap-mer derivation.

i8735_canu_hetopt_purged_polished_scaffolded_hapdup_dual_hap1_vs_hap2 hapmers blob

i8735_canu_hetopt_purged_polished_scaffolded_hapdup_dual_hap1_vs_hap2 spectra-asm fl

Request for Input: I'm reaching out for guidance on deciphering these blob plots and for suggestions on extra validation procedures. I'm particularly interested in any methodological alterations that could mitigate over-filtering or adapt to the unavailability of parental genomes.

Specific Questions:

Thank you for your time and consideration,

Camilo

PS. Here are the commands I used to run hapdup:

# Step 1: Align reads to the assembly using Winnowmap and generate SAM file
$WINNOWMAP_PATH -t $THREADS -W "$REPEAT_FILE" -ax map-ont "$ASSEMBLY_PATH" "$CANU_CORRECTED_READS" > "$OUTPUT_DIR/output.sam"

# Step 2: Convert SAM to BAM and sort
samtools view -@ $THREADS -bS "$OUTPUT_DIR/output.sam" | samtools sort -@ $THREADS -o "$OUTPUT_DIR/mapping.bam"

# Step 3: Index the BAM file
samtools index -@ $THREADS "$OUTPUT_DIR/mapping.bam"

# Step 4: Run HapDup
/bin/singularity run -B /usr/lib/locale/,/data,/tmp,$PWD docker://mkolmogo/hapdup:0.12 hapdup --assembly "$(basename "$ASSEMBLY_PATH")" --bam mapping.bam --out-dir "$OUTPUT_DIR/hapdup" -t $THREADS --rtype ont

Here are some stats of the assemblies from hapdup:

Screen Shot 2024-03-13 at 17 15 53 PM

And these are some quick syntenic relationships between dual hap1 an hap2:

image
mikolmogorov commented 3 months ago

Hi @cahuparo

It does seems a bit too good to be true.. In theory, if you have enough heterogeneity, you can phase entire chromosomes but it's hard to believe there were not phasing errors. Hapdup actually does produce the phase block coordinates as output - so you can see how fragmented these are.

You mentioned you are extracting k-mers from Illumina data - is it trio (e.g. paternal and maternal sequencing)? Does the process of building Meryl database has access to hapdup assmeblies?

Best, Misha

cahuparo commented 3 months ago

Hi @fenderglass,

Thanks! I am glad I asked you about this. Here I am focusing on both the evaluation of phased block coordinates (1) and the creation and validation of k-mer databases (2).

(1) Evaluating Phased Block Coordinates

Upon inspecting the phased block coordinates produced by HapDup, we discovered entries with negative block sizes, indicating start positions greater than the end positions for some blocks. This was an unexpected finding that suggested potential data or processing issues but I am not an expert on this so I wanted to get your opinion:

Found negative block sizes in phased_blocks_hp1.bed:
            chrom    start     end description haplotype_concordance  block_size
1342  tig00000215  1121071  819781   ContigEnd                   NaN     -301290
1359  tig00000228    51168   17795   ContigEnd                   NaN      -33373
1387  tig00000323    73888   55384   ContigEnd                   NaN      -18504
1395  tig00000368   402353  376823   ContigEnd                   NaN      -25530
1403  tig00000430   409159  322414   ContigEnd                   NaN      -86745
1498  tig00000861    49855   19556   ContigEnd                   NaN      -30299
1562  tig00001057    49659   27517   ContigEnd                   NaN      -22142
Found negative block sizes in phased_blocks_hp2.bed:
            chrom    start     end description haplotype_concordance  block_size
1339  tig00000215  1121071  819781   ContigEnd                   NaN     -301290
1356  tig00000228    51173   17800   ContigEnd                   NaN      -33373
1385  tig00000323    73667   55163   ContigEnd                   NaN      -18504
1393  tig00000368   402352  376822   ContigEnd                   NaN      -25530
1401  tig00000430   409158  322414   ContigEnd                   NaN      -86744
1494  tig00000861    49852   19553   ContigEnd                   NaN      -30299
1558  tig00001057    49657   27515   ContigEnd                   NaN      -22142
Haplotype 1 - Min Size: -301290, Max Size: 788068, Average Size: 42778.970725995314
Haplotype 2 - Min Size: -301290, Max Size: 760058, Average Size: 42777.01705882353

Are these the phasing errors? Is it weird that there only this many? Are these the a result of inversion misassemblies or artefacts of the phasing process?

(2) Revisiting the entire approach including the kmer database creation:

Yes, we used illumina data for that specific strain. No, we don't have trio data. Here is a more detail explanation of my approach:

Step 1: Genome Assembly with Canu

canu -p isolate_name -d output_directory genomeSize=100m -nanopore-raw input_reads.fastq minReadLength=1000 minOverlapLength=1000 correctedErrorRate=0.010 maxThreads=32 maxMemory=500 useGrid=0 corOutCoverage=40 corMhapSensitivity=high batOptions='-dg 3 -db 3 -dr 1 -ca 500 -cp 50'

Step 2: Purging Duplicates

minimap2 -x map-ont -t 32 assembly.fasta correctedReads.fasta.gz | gzip -c - > minimap2.paf.gz
pbcstat minimap2.paf.gz
calcuts PB.stat > cutoffs 2>calcults.log
split_fa assembly.fasta > assembly.split.fasta
minimap2 -x asm5 assembly.split.fasta assembly.split.fasta | gzip -c - > assembly.split.self.paf.gz
purge_dups -2 -T cutoffs -c PB.base.cov assembly.split.self.paf.gz > dups.bed 2> purge_dups.log
get_seqs dups.bed assembly.fasta

Step 3: Generate a Database of Repetitive Elements AND Polishing with NextPolish2

NextPolish2 is used for genome assembly polishing to improve assembly quality using both long reads (ONT) and short reads (Illumina).

Generate a Database of Repetitive Elements: This step helps in optimizing the mapping of reads, especially in repetitive regions of the genome:

# Count k-mers and create a database
meryl count k=15 output merylDB "${ASSEMBLY_PATH}"
# Identify and export repetitive elements based on k-mer frequency
meryl print greater-than distinct=0.9998 merylDB > repetitive_k15.txt

Polishing with NextPolish2: Use the repetitive elements information during read mapping for polishing.

# Example command to run NextPolish2, incorporating the handling of repetitive elements
nextPolish2 -r -t 32 "${ASSEMBLY_PREFIX}_map.sort.bam" "${ASSEMBLY_PATH}" k21.yak k31.yak > "${POLISH_DIR}/${ASSEMBLY_PREFIX}_polished.fa"

Identifying and managing repetitive elements before polishing can significantly reduce the chances of misassembly or errors in regions with high sequence similarity. By informing the polishing process about these repetitive elements, NextPolish2 can more accurately use the read data for correcting the assembly, leading to a higher quality final genome sequence. "In therory"

Step 4: Scaffolding with RagTag

ragtag.py scaffold reference.fasta assembly.fasta -r -m 50000 -g 1 -o output_directory -t 32

Step 5: Haplotype Resolution with HapDup

# Align reads to the assembly using optimized parameters for repetitive regions
winnowmap -t threads -W repetitive_k15.txt -ax map-ont assembly.fasta canucorrectedReads.fastq.gz > output.sam

# Convert SAM to sorted and indexed BAM
samtools sort -@ threads -o mapping.bam output.sam
samtools index mapping.bam

# Run HapDup for haplotype resolution
hapdup --assembly assembly.fasta --bam mapping.bam --out-dir hapdup_output -t threads --rtype ont

Step 6: Haplotype Assembly Evaluation with Merqury

The evaluation process involved the following steps:

Generating K-mer Databases:

K-mer databases are generated for both the sequencing reads (ONT and PE) and the haplotyped assemblies using Meryl. For Sequencing Reads (ONT and PE):

# Generate k-mer databases for Illumina (PE) reads for isolates i8733 and i8735
meryl k=18 count *8735*.fastq.gz output i8735_pe.meryl threads=32 

# Repeat the process for ONT corrected reads
meryl k=18 count 8735.correctedReads.fasta.gz output i8735_ONT.meryl threads=32

For Haplotyped Assemblies: Use similar commands to generate k-mer databases for each of the haplotyped assemblies.

meryl k=18 count output hapdup_phased_1._1merylDBs hapdup_phased_1.fasta
meryl k=18 count output hapdup_phased_2.merylDBs hapdup_phased_2.fasta

meryl k=18 count output hapdup_dual_1merylDBs hapdup_dual_1.fasta
meryl k=18 count output hapdup_dual_2merylDBs hapdup_dual_2.fasta

Merqury Analysis:

Merqury performs quality evaluation by comparing the k-mer composition of the assemblies against the k-mer composition derived from sequencing reads. This comparison helps identify discrepancies and assess the quality and completeness of the assemblies.

  1. Generate unique haplotype k-mer databases (hap1_unique.meryl and hap2_unique.meryl) for each isolate and assembly type using the meryl difference command.
  2. Create intersect databases with ONT and PE reads (hap1_ont_intersect.meryl, hap1_pe_intersect.meryl, hap2_ont_intersect.meryl, hap2_pe_intersect.meryl) using the meryl intersect command.
  3. Combine the ONT and PE intersects for each haplotype (hap1_combined.meryl and hap2_combined.meryl) using the meryl union-sum command.

finally run merqury:

merqury.sh ${isolate}_pe.meryl ${base_name}_hap1.hapmers.meryl ${base_name}_hap2.hapmers.meryl $hap1_assembly $hap2_assembly ${base_name}_hap1_vs_hap2

Thank you for your time and consideration in this complex yet fascinating endeavor towards understanding haplotype-resolved genome assemblies.

Best,

Camilo

mikolmogorov commented 3 months ago

Hi Camilo,

Negative phased blocks are definitely unexpected - do you see those in a file in the Margin output dir that ends with .phaseset.bed?

cahuparo commented 3 months ago

Hi Misha,

See attached the bed file in txt format because the bed extension can't be attached.

MARGIN_PHASED.phaseset.bed

For some of those negative phased blocks:

tig00000368     16316   16316   MissingConcordancy      H1-0_H2-0
tig00000368     268646  268646  MissingConcordancy      H1-0_H2-0
tig00000368     338003  338003  MissingConcordancy      H1-1_H2-1
tig00000368     376817  376821  MissingConcordancy      H1-1_H2-0
tig00000368     402351  376821  ContigEnd       

tig00000368     16316   16316   MissingConcordancy      H1-0_H2-0
tig00000368     268646  268646  MissingConcordancy      H1-0_H2-0
tig00000368     338003  338003  MissingConcordancy      H1-1_H2-1
tig00000368     376817  376821  MissingConcordancy      H1-1_H2-0
tig00000368     402351  376821  ContigEnd       

What does that mean? Could you help me understand what is wrong and if the results can be trusted?

Thanks for your time!

Camilo

mikolmogorov commented 3 months ago

Hi Camilo,

It seems like a potential issue with Margin output - may be some kind of borderline case. But it should not affect the assembly results, as long as the phasing stats look reasonable (e.g. phasing N50 few 100s kb to Mb, phased block length comparable to genome size).