frameshift_deletion_checks, simBench & Haplotype callers

Dear V-pipe authors,

Thank you very much for your pipeline. I attempt to use V-pipe to assess the intra-host viral diversity in plant samples. However, I encountered the following issues:

When running the Quality assessment module, (setting it to true in the config.yaml file), I get an error related to the frameshift_deletion_checks script. I have attached the relevant error log below:

Error in rule frameshift_deletions_checks: 

    jobid: 9 

    output: results_rev/capsicum/41239130/references/frameshift_deletions_check.tsv 

    log: results_rev/capsicum/41239130/references/frameshift_deletions_check.out.log, results_rev/capsicum/41239130/references/frameshift_deletions_check.err.log (check log file(s) for error message) 

    conda-env: {...}/vpipe_merged_reads/41239130_Capsicum_annuum_TSWV/.snakemake/conda/c2a6b8c9e98375cd16af9408a0e9b8b2 

    shell: 

        frameshift_deletions_checks -i results_rev/capsicum/41239130/alignments/REF_aln.bam -c results_rev/capsicum/41239130/references/consensus.bcftools.fasta -f references_rev/41239130_tswv_conc_rev.fasta -g  --english=true -o results_rev/capsicum/41239130/references/frameshift_deletions_check.tsv 2> >(tee results_rev/capsicum/41239130/references/frameshift_deletions_check.err.log >&2) 

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!) 

Shutting down, this might take some time. 

Exiting because a job execution failed. Look above for error message 

Complete log: {...}/vpipe_merged_reads/41239130_Capsicum_annuum_TSWV/.snakemake/log/2022-11-01T152833.441917.snakemake.log

The error is not fixed even if I provide the .gff file in the config.yaml with the following two alternative syntaxes:

root: 

   frameshift_deletion_checks :  

        genes_gff : gff_dir/[…].gff3

frameshift_deletion_checks: 

    genes_gff : gff_dir/[…].gff3

I cannot find the benchmarking functionality in the latest documentation. In the config HTML file, there are no explanations and tags on how to generate reads and overall use the benchmarking modules. Downloading the benchmark branch and adjusting to the deprecated documentation was not successful. Are you planning to update the documentation for the newest version of your master branch, so that the benchmarking capabilities of the pipeline can be utilized? Can you please point me to the documentation of the benchmarking?
Considering the Haplotype calling modules, I consistently get segmentation faults when using PredictHaplo (changing between local to global analysis) and (core dumped) errors when using Haploclique. I was trying these options as I want to perform reference-based reconstruction of haplotypes. Can you provide any ideas on the reasons that such errors pop up?

Haploclique error log:

terminate called after throwing an instance of 'std::bad_alloc'
what():  std::bad_alloc

PredictHaplo error log remains empty, but after completing the local haplotype reconstruction, raises a segmentation fault. It may be relevant that my datasets show very high coverage (50.000x -350.000x).

Please let me know if you need any more information to answer these issues. Thank you very much for your consideration.

Dimitris Karapliafis

Dear Dimitris,

Regarding your first problem, frameshift_deletion_checks:

We have only extensively tested frameshifting_deletions_checks with SARS-CoV-2 viruses. It is possible that your work on capsicum's virus has unearthed a new unknown corner case.

Which version of V-pipe are you running? (what is the output of git describe or git rev-parse HEAD?) What is the content of the file workflow/envs/smallgenomeutilities.yaml Today's V-pipe release v2.99.3 has moved to use smallgenomeutilities version v0.3.9 which fixes some crashes in frameshift_deletions_checks when searching for stop codons.

Could you also please provide the two log files mentionned by snakemake?:

results_rev/capsicum/41239130/references/frameshift_deletions_check.out.log,
results_rev/capsicum/41239130/references/frameshift_deletions_check.err.log

This might help pin-point the origin of the crash.

For providing the GFF, the second is the correct syntax (root is the name of the .yaml file it-self, so the section is frameshift_deletion_checks):
```
frameshift_deletion_checks: 
genes_gff : gff_dir/[…].gff3 
```

This GFF file is used to determin the position of the genes on the virus' genome, in order to determine the open reading frames from which to scan all codon for premature "stop" introduced by mutations.

In case that switching to the latest V-pipe (with the latest smallgenomeutilities) doesn't solve your problem, would you accept sending the .bam, the consensus.bcftools.fasta and the reference .fasta?

Regarding you second question, simBench:

Indeed, the benchmarking infrastructure that was introduced with Vpipe 2.0 and documented in the wiki has been deprecated in the recent reengineering of V-pipe.

A new benchmarking infrastructure is currently being developped. (An early preview of this feature has been made available in v2.99.3, more of it is going to be introduced with the next master merge).

the subdirectory resources/auxiliary_workflows/benchmark contain more information.

@LaraFuhrmann could probably help you get more informations about this.

Keep in mind that it is still considered experimental.

Regarding your third question, global haplotype:

Well, the global haplotype reconstruction is still an ongoing research field and far from solved. The software is still buggy.

Regarding HaploClique, from the last time I've experimented with it, on some sample the number of cliques keep increasing in the early iterations and the memory consumption explodes. Tweaking the parameters of haploclique could perhaps help keeping the collection of cliques within a manageable range?

Regarding PredictHaplo, there are still bugs in the code (in the past most came from corner cases (e.g.: empty list) causing memory corruption and crashes). Debugging this requires skills that are in short supply in our small team. I cannot guarantee that I'll find time or that I'll be able to find someone else to debug it, but if you send the input data, it would be possible to try at some point.

Dear @DrYak,

thank you very much for your swift and elaborate response.

Regarding Q1 I updated my V-pipe installation as you suggested. However, by reviewing the error log file, the error was revealed, and it was on my side. TL;DR: frameshift_deletions_checks works fine!

For any user that may face the same issue, I will describe in detail the error:

Following the error log shows that the .gff file is not given as input for the frameshift_deletions_checks: results_rev/capsicum/41239130/references/frameshift_deletions_check.err.log:


usage: frameshift_deletions_checks [-h] -i BAM -c FASTA -f FASTA -g GFF [-0] [-o TSV] [-E] [-e [ENGLISH]]lsman:/lustre
frameshift_deletions_checks: error: argument -g/--genes: expected one argument

My initial config.yaml file was the following:

general:
    aligner: bwa
    haplotype_reconstruction: predicthaplo
    snv_caller: lofreq

input:
    samples_file: samples.tsv
    reference: references_rev/41239130_tswv_conc_rev.fasta
    read_length: 151
    **gff_directory: gffs_rev**

output:
    datadir: results_rev
    snv: true
    local: false
    global: true
    visualization: true
    diversity: false
    QA: true

Specifying the gff directory under the input tab as gff_directory: gffs_rev does not update the gff file to be given as input to the frameshift_deletions_checks but is only used for the visualization module of V-pipe.

My second config.yaml file:

`general:
    aligner: bwa
    haplotype_reconstruction: predicthaplo
    snv_caller: lofreq

input:
    samples_file: samples.tsv
    reference: references_rev/41239130_tswv_conc_rev.fasta
    read_length: 151
    gff_directory: gffs_rev

output:
    datadir: results_rev
    snv: true
    local: false
    global: true
    visualization: false
    diversity: false
    QA: false

frameshift_deletion_checks:                                           --->TYPO: Should be:--->frameshift_deletions_checks
    genes_gff : gffs_rev/41239130_tswv_conc_rev.gff3

Logically, the error log was the same as before, as the typo prevents f_d_c to read the .gff file.

It is also worth noting that I did not find in the documentation that a .gff file is a prerequisite for the QA process. Would be nice if you print an informative error message in such cases. A second aspect I would like to note is that the .gff file has to be specified two different times in the config.yaml, which is a bit confusing. If it does not limit the overall functionality of the pipeline, would that make sense to read the .gff needed for frameshift_deletions_check directly from input -> gff directory?

Regarding Q2:

Thank you very much for the information. I will probably resolve with writing a script for generating the reads myself. It is a great idea to include such benchmarking functionality in your tool!

Regarding Q3:

Yes, my experience is also frustrating with haplotype callers. Because I am not the original owner of the datasets, I will have to ask for permission in order to share the .bam file. I think that Haploclique severely struggles when the coverage is high. One alternative that worked for me in terms of at least consistently delivering the output is CliqueSNV (The outputted haplotypes though do not always make sense biologically).

Again, thank you very much for your time and detailed responses. V-pipe looks very promising!

With kind regards, Dimitris Karapliafis

Thank you for point out these pain points.

Yes, indeed, we should add more sanity checks and refuse to run framesht_deletions_checks if there is no proper GFF provided (it's going to be hard to track indels causing _frameshifting if we don't have a GFF to point where the open reading frames_ are on the genome :sweat_smile:). Though the current situation is very confusing:

there a general parameter to pass a large collection of GFFs, used also for visualisation (allows separate GFF for open reading frames/Genes, for mature product/protease output, and for functional domain)
there's a local option for frameshift_deletions_checks which is separate.

We should find a way to make both more straight forward.

The validation is still lacking. We do rely on schemas to validate the configuration, but the official documentations of snakemake uses a validator that is tolerant of missing/unknown keyword (it will simply not enforce types and defaults on unknown keywords).

(This is something which wasn't a problem with the legacy INI-like format of vpipe.config used back in V-pipe 1.x / 2.x and currently still compatible. As python's configparse doesn't have a type system, type must be manually imported and thus exact structure is enforced).

I think we should consider:

adding some extra validation steps (spot keyword which are unknown, but a small number of edits away from known one.
- e.g.: "_Unkown keyword frameshift_deletion_checks. Did you mean frameshift_deletions_checks ?_"
Consider creating a GUI editor of configuration file, based on the schema.
- This is something that we had discussed at some point with the people behind Sapporo Web

Small update:

Commit #58a9cccca3bf25021a016031f19f0816db748f44 helps a bit the user experience with multiple GFF files.

GFFs are now all centralised in the same section: [general]
If a Genes-specific GFF isn't explicitely provided, V-pipe will try to autoselect one among all the visualisations GFFs (based on names)

cbg-ethz / V-pipe

frameshift_deletion_checks, simBench & Haplotype callers #132