Closed Donaim closed 8 months ago
Current version of the algorithm (https://github.com/cfe-lab/MiCall/pull/1032/commits/ea58060b47d625ea47fd571e0487c0d5bee6389b) does work for simple cases.
Current version of the algorithm (https://github.com/cfe-lab/MiCall/pull/1032/commits/7a153c05a4d1448798039f291b8065c9145a90e7) works on real-world examples, and produces expected results. Now it is about finding and fixing individual bugs.
Currently working on the diagnostics. It turned out to be useful for finding bugs... Reordering goals.
Part of the diagnostics is the visualizer (diagram maker) that is based on logs. It turned out to be almost as difficult to implement as the stitcher itself.
The first version is implemented in https://github.com/cfe-lab/MiCall/pull/1032/commits/7e84f61c185a66bcbd37bc091a159f922a3d6fd6
The task list in the issue description has been updated to better reflect the conceptual progress and milestones we've achieved as documented in our commits. These updates stem directly from our commit history and practical work on the stitcher and its diagnostics.
The introduction of new stitcher changes contents and handling of some input/output files.
Below is a breakdown:
The following table lists files that have identical contents in both the old and new versions. The dash symbol (-
) indicates that the contents of the old file may differ from the new file, although they might coincide occasionally.
Old File | New File |
---|---|
g2p_csv | g2p_csv |
g2p_summary_csv | g2p_summary_csv |
remap_counts_csv | - |
remap_conseq_csv | - |
unmapped1_fastq | unmapped1_fastq |
unmapped2_fastq | unmapped2_fastq |
conseq_ins_csv | conseq_ins_csv |
failed_csv | failed_csv |
cascade_csv | - |
nuc_csv | - |
amino_csv | - |
insertions_csv | - |
conseq_csv | unstitched_conseq_csv |
conseq_all_csv | - |
concordance_csv | - |
concordance_seed_csv | - |
failed_align_csv | - |
coverage_scores_csv | - |
coverage_maps_tar | - |
aligned_csv | - |
g2p_aligned_csv | g2p_aligned_csv |
genome_coverage_csv | - |
genome_coverage_svg | - |
genome_concordance_svg | - |
contigs_csv | unstitched_contigs_csv |
read_entropy_csv | read_entropy_csv |
conseq_region_csv | - |
conseq_stitched_csv | - |
The following table lists files that serve the same purpose in the pipeline across most use cases and within the proviral pipeline specifically:
Old File | Most Usecases | Proviral Pipeline |
---|---|---|
g2p_csv | g2p_csv | |
g2p_summary_csv | g2p_summary_csv | |
remap_counts_csv | remap_counts_csv | |
remap_conseq_csv | remap_conseq_csv | |
unmapped1_fastq | unmapped1_fastq | |
unmapped2_fastq | unmapped2_fastq | |
conseq_ins_csv | conseq_ins_csv | |
failed_csv | failed_csv | |
cascade_csv | cascade_csv | cascade_csv |
nuc_csv | nuc_csv | |
amino_csv | amino_csv | |
insertions_csv | insertions_csv | |
conseq_csv | conseq_csv | unstitched_conseq_csv |
conseq_all_csv | conseq_all_csv | |
concordance_csv | concordance_csv | |
concordance_seed_csv | concordance_seed_csv | |
failed_align_csv | failed_align_csv | |
coverage_scores_csv | coverage_scores_csv | |
coverage_maps_tar | coverage_maps_tar | |
aligned_csv | aligned_csv | |
g2p_aligned_csv | g2p_aligned_csv | |
genome_coverage_csv | genome_coverage_csv | |
genome_coverage_svg | genome_coverage_svg | |
genome_concordance_svg | genome_concordance_svg | |
contigs_csv | contigs_csv | unstitched_contigs_csv |
read_entropy_csv | read_entropy_csv | |
conseq_region_csv | conseq_region_csv | |
conseq_stitched_csv | conseq_csv |
The existing implementation of stitching has shown to produce nonsensical results in certain cases. The results from stitching should be a logical summation of its parts, but currently, they sometimes are not. The root cause appears to be the reliance on regions of the reference genome, rather than contigs produced by the assembler. And in cases when some regions have low concordance with the reference genome, they are aligned differently, producing conflicting versions of overlaps between them.
Objectives:
Tasks:
Treat cross-alignments as anomalies.test_correct_processing_of_two_overlapping_and_one_separate_contig_2.svg
.strand
parameter is checked every time an arrow is drawn in the visualizer.Make the oldcontigs.csv
file still produce the same output as before the stitcher by introducing a new output filecontigs_stitched.csv
to be used in downstream analyses.contigs_unstitched.csv
andremap_unstitched_conseq.csv
Notes:
This reimplementation may provide opportunities for simplification in the regions alignment code, which is currently very complex.