Closed dimitris-karapliafis closed 1 year ago
Dear Dimitris,
Regarding your first problem, frameshift_deletion_checks:
We have only extensively tested frameshifting_deletions_checks with SARS-CoV-2 viruses. It is possible that your work on capsicum's virus has unearthed a new unknown corner case.
Which version of V-pipe are you running? (what is the output of git describe
or git rev-parse HEAD
?)
What is the content of the file workflow/envs/smallgenomeutilities.yaml
Today's V-pipe release v2.99.3 has moved to use smallgenomeutilities version v0.3.9 which fixes some crashes in frameshift_deletions_checks when searching for stop codons.
Could you also please provide the two log files mentionned by snakemake?:
results_rev/capsicum/41239130/references/frameshift_deletions_check.err.log
This might help pin-point the origin of the crash.
For providing the GFF, the second is the correct syntax (root
is the name of the .yaml file it-self, so the section is frameshift_deletion_checks
):
frameshift_deletion_checks:
genes_gff : gff_dir/[…].gff3
This GFF file is used to determin the position of the genes on the virus' genome, in order to determine the open reading frames from which to scan all codon for premature "stop" introduced by mutations.
In case that switching to the latest V-pipe (with the latest smallgenomeutilities) doesn't solve your problem, would you accept sending the .bam, the consensus.bcftools.fasta and the reference .fasta?
Regarding you second question, simBench:
Indeed, the benchmarking infrastructure that was introduced with Vpipe 2.0 and documented in the wiki has been deprecated in the recent reengineering of V-pipe.
A new benchmarking infrastructure is currently being developped. (An early preview of this feature has been made available in v2.99.3, more of it is going to be introduced with the next master merge).
the subdirectory resources/auxiliary_workflows/benchmark contain more information.
@LaraFuhrmann could probably help you get more informations about this.
Keep in mind that it is still considered experimental.
Regarding your third question, global haplotype:
Well, the global haplotype reconstruction is still an ongoing research field and far from solved. The software is still buggy.
Regarding HaploClique, from the last time I've experimented with it, on some sample the number of cliques keep increasing in the early iterations and the memory consumption explodes. Tweaking the parameters of haploclique could perhaps help keeping the collection of cliques within a manageable range?
Regarding PredictHaplo, there are still bugs in the code (in the past most came from corner cases (e.g.: empty list) causing memory corruption and crashes). Debugging this requires skills that are in short supply in our small team. I cannot guarantee that I'll find time or that I'll be able to find someone else to debug it, but if you send the input data, it would be possible to try at some point.
Dear @DrYak,
thank you very much for your swift and elaborate response.
Regarding Q1 I updated my V-pipe installation as you suggested. However, by reviewing the error log file, the error was revealed, and it was on my side. TL;DR: frameshift_deletions_checks works fine!
For any user that may face the same issue, I will describe in detail the error:
Following the error log shows that the .gff file is not given as input for the frameshift_deletions_checks: results_rev/capsicum/41239130/references/frameshift_deletions_check.err.log:
usage: frameshift_deletions_checks [-h] -i BAM -c FASTA -f FASTA -g GFF [-0] [-o TSV] [-E] [-e [ENGLISH]]lsman:/lustre
frameshift_deletions_checks: error: argument -g/--genes: expected one argument
My initial config.yaml file was the following:
general:
aligner: bwa
haplotype_reconstruction: predicthaplo
snv_caller: lofreq
input:
samples_file: samples.tsv
reference: references_rev/41239130_tswv_conc_rev.fasta
read_length: 151
**gff_directory: gffs_rev**
output:
datadir: results_rev
snv: true
local: false
global: true
visualization: true
diversity: false
QA: true
Specifying the gff directory under the input tab as gff_directory: gffs_rev does not update the gff file to be given as input to the frameshift_deletions_checks but is only used for the visualization module of V-pipe.
My second config.yaml file:
`general:
aligner: bwa
haplotype_reconstruction: predicthaplo
snv_caller: lofreq
input:
samples_file: samples.tsv
reference: references_rev/41239130_tswv_conc_rev.fasta
read_length: 151
gff_directory: gffs_rev
output:
datadir: results_rev
snv: true
local: false
global: true
visualization: false
diversity: false
QA: false
frameshift_deletion_checks: --->TYPO: Should be:--->frameshift_deletions_checks
genes_gff : gffs_rev/41239130_tswv_conc_rev.gff3
Logically, the error log was the same as before, as the typo prevents f_d_c to read the .gff file.
It is also worth noting that I did not find in the documentation that a .gff file is a prerequisite for the QA process. Would be nice if you print an informative error message in such cases. A second aspect I would like to note is that the .gff file has to be specified two different times in the config.yaml, which is a bit confusing. If it does not limit the overall functionality of the pipeline, would that make sense to read the .gff needed for frameshift_deletions_check directly from input -> gff directory?
Regarding Q2:
Thank you very much for the information. I will probably resolve with writing a script for generating the reads myself. It is a great idea to include such benchmarking functionality in your tool!
Regarding Q3:
Yes, my experience is also frustrating with haplotype callers. Because I am not the original owner of the datasets, I will have to ask for permission in order to share the .bam file. I think that Haploclique severely struggles when the coverage is high. One alternative that worked for me in terms of at least consistently delivering the output is CliqueSNV (The outputted haplotypes though do not always make sense biologically).
Again, thank you very much for your time and detailed responses. V-pipe looks very promising!
With kind regards, Dimitris Karapliafis
Thank you for point out these pain points.
Yes, indeed, we should add more sanity checks and refuse to run framesht_deletions_checks
if there is no proper GFF provided (it's going to be hard to track indels causing _frameshifting if we don't have a GFF to point where the open reading frames_ are on the genome :sweat_smile:).
Though the current situation is very confusing:
We should find a way to make both more straight forward.
The validation is still lacking. We do rely on schemas to validate the configuration, but the official documentations of snakemake uses a validator that is tolerant of missing/unknown keyword (it will simply not enforce types and defaults on unknown keywords).
(This is something which wasn't a problem with the legacy INI-like format of vpipe.config
used back in V-pipe 1.x / 2.x and currently still compatible. As python's configparse doesn't have a type system, type must be manually imported and thus exact structure is enforced).
I think we should consider:
frameshift_deletion_checks
. Did you mean frameshift_deletion
s
_checks
?_"Small update:
Commit #58a9cccca3bf25021a016031f19f0816db748f44 helps a bit the user experience with multiple GFF files.
[general]
Dear V-pipe authors,
Thank you very much for your pipeline. I attempt to use V-pipe to assess the intra-host viral diversity in plant samples. However, I encountered the following issues:
The error is not fixed even if I provide the .gff file in the config.yaml with the following two alternative syntaxes:
or
I cannot find the benchmarking functionality in the latest documentation. In the config HTML file, there are no explanations and tags on how to generate reads and overall use the benchmarking modules. Downloading the benchmark branch and adjusting to the deprecated documentation was not successful. Are you planning to update the documentation for the newest version of your master branch, so that the benchmarking capabilities of the pipeline can be utilized? Can you please point me to the documentation of the benchmarking?
Considering the Haplotype calling modules, I consistently get segmentation faults when using PredictHaplo (changing between local to global analysis) and (core dumped) errors when using Haploclique. I was trying these options as I want to perform reference-based reconstruction of haplotypes. Can you provide any ideas on the reasons that such errors pop up?
Haploclique error log:
PredictHaplo error log remains empty, but after completing the local haplotype reconstruction, raises a segmentation fault. It may be relevant that my datasets show very high coverage (50.000x -350.000x).
Please let me know if you need any more information to answer these issues. Thank you very much for your consideration.
Dimitris Karapliafis