cfe-lab / MiCall

Pipeline for processing FASTQ data from an Illumina MiSeq to genotype human RNA viruses like HIV and hepatitis C
https://cfe-lab.github.io/MiCall
GNU Affero General Public License v3.0
14 stars 9 forks source link

Reimplement the stitcher #1030

Closed Donaim closed 8 months ago

Donaim commented 1 year ago

The existing implementation of stitching has shown to produce nonsensical results in certain cases. The results from stitching should be a logical summation of its parts, but currently, they sometimes are not. The root cause appears to be the reliance on regions of the reference genome, rather than contigs produced by the assembler. And in cases when some regions have low concordance with the reference genome, they are aligned differently, producing conflicting versions of overlaps between them.

Objectives:

  1. In scenarios where a single contig has been assembled, stitching should return it as the stitched consensus.
  2. When multiple contigs are present, the result of putting them together should not be too surprising.
  3. Other parts of the pipeline should not be significantly affected by this change.

Tasks:

Notes:

This reimplementation may provide opportunities for simplification in the regions alignment code, which is currently very complex.

Donaim commented 1 year ago

Current version of the algorithm (https://github.com/cfe-lab/MiCall/pull/1032/commits/ea58060b47d625ea47fd571e0487c0d5bee6389b) does work for simple cases.

Donaim commented 1 year ago

Current version of the algorithm (https://github.com/cfe-lab/MiCall/pull/1032/commits/7a153c05a4d1448798039f291b8065c9145a90e7) works on real-world examples, and produces expected results. Now it is about finding and fixing individual bugs.

Donaim commented 12 months ago

Currently working on the diagnostics. It turned out to be useful for finding bugs... Reordering goals.

Donaim commented 10 months ago

Part of the diagnostics is the visualizer (diagram maker) that is based on logs. It turned out to be almost as difficult to implement as the stitcher itself.

The first version is implemented in https://github.com/cfe-lab/MiCall/pull/1032/commits/7e84f61c185a66bcbd37bc091a159f922a3d6fd6

Donaim commented 10 months ago

The task list in the issue description has been updated to better reflect the conceptual progress and milestones we've achieved as documented in our commits. These updates stem directly from our commit history and practical work on the stitcher and its diagnostics.

Donaim commented 2 months ago

The introduction of new stitcher changes contents and handling of some input/output files.

Below is a breakdown:

Same Content

The following table lists files that have identical contents in both the old and new versions. The dash symbol (-) indicates that the contents of the old file may differ from the new file, although they might coincide occasionally.

Old File New File
g2p_csv g2p_csv
g2p_summary_csv g2p_summary_csv
remap_counts_csv -
remap_conseq_csv -
unmapped1_fastq unmapped1_fastq
unmapped2_fastq unmapped2_fastq
conseq_ins_csv conseq_ins_csv
failed_csv failed_csv
cascade_csv -
nuc_csv -
amino_csv -
insertions_csv -
conseq_csv unstitched_conseq_csv
conseq_all_csv -
concordance_csv -
concordance_seed_csv -
failed_align_csv -
coverage_scores_csv -
coverage_maps_tar -
aligned_csv -
g2p_aligned_csv g2p_aligned_csv
genome_coverage_csv -
genome_coverage_svg -
genome_concordance_svg -
contigs_csv unstitched_contigs_csv
read_entropy_csv read_entropy_csv
conseq_region_csv -
conseq_stitched_csv -

Same Role

The following table lists files that serve the same purpose in the pipeline across most use cases and within the proviral pipeline specifically:

Old File Most Usecases Proviral Pipeline
g2p_csv g2p_csv
g2p_summary_csv g2p_summary_csv
remap_counts_csv remap_counts_csv
remap_conseq_csv remap_conseq_csv
unmapped1_fastq unmapped1_fastq
unmapped2_fastq unmapped2_fastq
conseq_ins_csv conseq_ins_csv
failed_csv failed_csv
cascade_csv cascade_csv cascade_csv
nuc_csv nuc_csv
amino_csv amino_csv
insertions_csv insertions_csv
conseq_csv conseq_csv unstitched_conseq_csv
conseq_all_csv conseq_all_csv
concordance_csv concordance_csv
concordance_seed_csv concordance_seed_csv
failed_align_csv failed_align_csv
coverage_scores_csv coverage_scores_csv
coverage_maps_tar coverage_maps_tar
aligned_csv aligned_csv
g2p_aligned_csv g2p_aligned_csv
genome_coverage_csv genome_coverage_csv
genome_coverage_svg genome_coverage_svg
genome_concordance_svg genome_concordance_svg
contigs_csv contigs_csv unstitched_contigs_csv
read_entropy_csv read_entropy_csv
conseq_region_csv conseq_region_csv
conseq_stitched_csv conseq_csv