BrentLab/callvariants - Githubissues

Introduction

BrentLab/callvariants is a bioinformatics pipeline for general variant calling. It runs Freebayes, TIDDIT and CNVpytor for SNP/INDEL and structural variant calling.

This workflow has been developed with the following specific functionality in mind:

Checking the genotype of KN99alpha samples
- This is performed by providing additional sequences to be appended to the genome prior to alignment in a per-smaple basis
Processing c. neoformans samples for bulk segregant analysis
- The Freebayes step can optionally be used to jointly call variants on groups which are identified in the input samplesheet

But there is no reason why it is limited to these applications.

The pipeline, overall, runs the following processes:

Prepare the Genome
- Concatenate additional sequences provided in the input samplesheet, if there are any
- Create indicies
  - samtools faidx
  - bwamem2 index
  - bwa index -- this is for TIDDIT
- Create sequence maps
  - build and create intervals. Both of these are from sarek
  - GATK CreateSequenceDictionary
Read QC
- fastQC
Align reads
- bwamem2
- picard MarkDuplicates
- picard AddOrReplaceReadGroup
- samtools index, sort, stats, flatstats, idxstats
Call Variants
- Freebayes
- TIDDIT
- CNVpytor
- snpEff
- vcftools for filtering
- bcftools stats
Collect and present QC
- MultiQC

Usage

If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with -profile test before running the workflow on actual data. If you are running this test on WUSTL HTCF or RIS, use one of the built-in profiles, either htcf or ris. If you are running the test on a different host, then you may consider including one of the dependency manager profiles, eg singularity or docker.

A test run for ris, for example, would look like this:

nextflow run BrentLab/callvariants -r main -profile ris,test

you will need to submit this appropriately, but no other input is necessary to run the tests -- all input is taken care of by the test profile

For detailed instructions on running your own data, please see the usage documentation

Output

For a description of the output directory, please see the output documentation

Common problems

A pernicious error is the character that symbolizes a "new line" in a file. We never see these characters, but they of course exist -- how else would the computer know where the new line is?

Mac, Windows and Linux operating systems use different carriage return characters, unfortunately. If you're using a cluster to process your data, you need to make sure that the files are linux compliant. All of the files you download from NCBI or fungiDB, for instance, will be, as will your fastq files from the sequencer centers. But, if you create an additional fasta file, you need to make sure that this hasn't been adulterated by Mac or Windows. I would expect that snapgene would by default output a linux compliant fasta file. But if you were to open the file in something like word, it would probably convert the characters.

One tool you can use to ensure that your input files are linux compliant is dos2unix. Using dos2unix looks like this:

dos2unix </path/to/file.<ext>>

and the file will be changed in-place.

Credits

BrentLab/callvariants was originally written by Chase Mateusiak. It is based on the BSA processing steps of Daniel Agustinhno.

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

Citations

If you use BrentLab/callvariants for your analysis, please cite it using the following doi: 10.5281/zenodo.XXXXXX

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.