CooVar version 0.07 (Nov 20th, 2012)

(0) INSTALL (1) MOTIVATION (2) WHAT IT DOES (3) HOW IT DOES IT (4) HOW TO RUN IT (5) INPUT (6) OUTPUT (7) TIPS FOR RUNNING COOVAR FASTER (8) HOW TO CITE COOVAR

(0) INSTALL

To install CooVar, open a Linux terminal and run the following commands:

tar xvf CooVar-0.07.tar.gz cd coovar-0.07 chmod +x scripts/* coovar.pl

The program itself is a Perl script and requires the following Perl modules to be installed on your system:

use Cwd; use Getopt::Long; use POSIX; use File::Basename; use List::Util; use Bio::DB::Fasta; use Bio::Seq; use Bio::SeqUtils; use Bio::SeqIO; use Set::IntervalTree; use Set::IntSpan;

(1) MOTIVATION

CooVar was conceived as an alternative to previous tools that are tightly linked to databases (e.g. Ensembl, UCSC) and for which there is no overall assessment of the impact of co-occurring GVs on protein-coding transcripts. The former makes it hard for people working with newly sequenced genomes, genomes that are not available in such databases, or who just have an alternative annotation of transcript models for which the impact of different GVs needs to be asssessed. The latter is misleading at the moment of evaluating the impact of multiple co-occurring GVs on protein-coding transcripts. For example, a 1 bp insertion and a 1 bp deletion may occur on the same transcript, and, if assessed independently, they cause disruption of the ORF, but together cancel each other, restoring the ORF. In the same way, 2 substitutions on a same codon may result in a different amino acid change than each substitution independently.

(2) WHAT IT DOES

CooVar takes as input:

(i) transcript models (GFF3 or GTF format), (ii) the genomic DNA where the models reside and (FASTA format) (iii) Single Nucleotide Variants (SNV), Insertions (INS) and Deletions (DEL) (collectively called Genomic Variations, or GVs)

and returns, for each transcript,

(i) a summary of all GVs impacting a protein-coding transcript, with a categorization of the Open Reading Frame (ORF) in: INTACT, PRESERVED, DISRUPTED or FULLY_DELETED. (ii) the cDNA before (called the reference) and after applying the GVs (called the variant), (iii) the reference and variant protein sequence, (iv) an alignment of the reference and variant cDNA at the exon level, (v) a number of basic statistics for each type of GV.

(3) HOW IT DOES IT

CooVar pursues the following steps:

(4.0) Revises the GV file in order to make sure that they follow the specified format (VCF or TAB). SNV and INS involving non-ACTG are discarded. Lines not following the specified format are discarded. (revise-gv-files.pl) (4.1) Extracts the cDNA for each transcript from the genomic DNA, using the transcript model coordinates, (extract-cdna.pl) (4.2) Applies SNPs to the cDNA (apply-snps.pl) (4.3) Categorizes SNPs according to its impact (or not) on protein-coding transcripts (categorize-snps.pl). (4.4) Computes stats on SNPs (get-stats-snps.pl) (4.5) Applies Insertions and Deletions to the cDNA sequences, and computes stats of length distribution (apply-insertions-deletions.pl) (4.6) If the --circos flag is used, it generates the input files for a heatmap visualization of each type of GV and for the exons with the Circos tool (Krzywinski et al., 2009). The circos data for GVs represents the number of base pairs inserted or deleted per interval length. The interval length
is automatically inferred from the genome size (interval length = genome size / 1000).

(4) HOW TO RUN IT

./coovar.pl -e EXONS_GFF -r REFERENCE_FASTA (-t GVS_TAB_FORMAT | -v GVS_VCF_FORMAT) [-o OUTPUT_DIRECTORY] [--circos] [--feature_source] [--feature_type]

Parameters in brackets [] are optional.

(5) INPUT

CooVar takes the following input:

(5.1) A gff3/gtf file specifying coding exons in form of CDS (coding-sequence) entries

Parameter: -e Each coding exon represents a part of a protein-coding transcript and excludes UTRs.

Example 1: transcript 2L52.1 in C. elegans in gff3 format:

II Coding_transcript CDS 1867 1911 . + 0 ID=CDS:2L52.1;Note=Zinc finger%2C C2H2;gene=2L52.1 II Coding_transcript CDS 2506 2694 . + 0 ID=CDS:2L52.1;Note=Zinc finger%2C C2H2;gene=2L52.1 II Coding_transcript CDS 2738 2888 . + 0 ID=CDS:2L52.1;Note=Zinc finger%2C C2H2;gene=2L52.1 II Coding_transcript CDS 2931 3036 . + 2 ID=CDS:2L52.1;Note=Zinc finger%2C C2H2;gene=2L52.1 II Coding_transcript CDS 3406 3552 . + 1 ID=CDS:2L52.1;Note=Zinc finger%2C C2H2;gene=2L52.1 II Coding_transcript CDS 3802 3984 . + 1 ID=CDS:2L52.1;Note=Zinc finger%2C C2H2;gene=2L52.1 II Coding_transcript CDS 4201 4663 . + 1 ID=CDS:2L52.1;Note=Zinc finger%2C C2H2;gene=2L52.1

Example 2: transcript ENST00000371026 in Human, in gtf format:

chr1 hg18_ensGene CDS 67052404 67052451 0.000000 - 0 gene_id "ENST00000371026"; transcript_id "ENST00000371026"; chr1 hg18_ensGene CDS 67060632 67060788 0.000000 - 1 gene_id "ENST00000371026"; transcript_id "ENST00000371026"; chr1 hg18_ensGene CDS 67065091 67065317 0.000000 - 0 gene_id "ENST00000371026"; transcript_id "ENST00000371026"; chr1 hg18_ensGene CDS 67066083 67066181 0.000000 - 0 gene_id "ENST00000371026"; transcript_id "ENST00000371026"; chr1 hg18_ensGene CDS 67071856 67071977 0.000000 - 2 gene_id "ENST00000371026"; transcript_id "ENST00000371026"; chr1 hg18_ensGene CDS 67072262 67072419 0.000000 - 1 gene_id "ENST00000371026"; transcript_id "ENST00000371026"; chr1 hg18_ensGene CDS 67073897 67074048 0.000000 - 0 gene_id "ENST00000371026"; transcript_id "ENST00000371026"; chr1 hg18_ensGene CDS 67075981 67076067 0.000000 - 0 gene_id "ENST00000371026"; transcript_id "ENST00000371026"; chr1 hg18_ensGene CDS 67078740 67078942 0.000000 - 2 gene_id "ENST00000371026"; transcript_id "ENST00000371026"; chr1 hg18_ensGene CDS 67085755 67085949 0.000000 - 2 gene_id "ENST00000371026"; transcript_id "ENST00000371026"; chr1 hg18_ensGene CDS 67100418 67100573 0.000000 - 2 gene_id "ENST00000371026"; transcript_id "ENST00000371026"; chr1 hg18_ensGene CDS 67109641 67109780 0.000000 - 1 gene_id "ENST00000371026"; transcript_id "ENST00000371026"; chr1 hg18_ensGene CDS 67113052 67113208 0.000000 - 2 gene_id "ENST00000371026"; transcript_id "ENST00000371026"; chr1 hg18_ensGene CDS 67129425 67129537 0.000000 - 1 gene_id "ENST00000371026"; transcript_id "ENST00000371026"; chr1 hg18_ensGene CDS 67131500 67131684 0.000000 - 0 gene_id "ENST00000371026"; transcript_id "ENST00000371026"; chr1 hg18_ensGene CDS 67143472 67143646 0.000000 - 1 gene_id "ENST00000371026"; transcript_id "ENST00000371026"; chr1 hg18_ensGene CDS 67162933 67163102 0.000000 - 0 gene_id "ENST00000371026"; transcript_id "ENST00000371026";

IMPORTANT: Note that by default CooVar ignores all lines in the exon file that are not of type 'CDS'. This behavior can be changed using the --feature_source and --feature_type program parameter. Also, make sure that different coding exons belonging to the same transcript have (a) the same 'transcript_id' (in case of a GTF input file) or (b) either the same 'ID' (GFF2 input) or the same 'Parent' ID (GFF3 input).

If the gene attribute is specified ('gene=' in gff3 or 'gene_id' in gtf) then this gene ID is carried over to the transcripts.gff3 output file. This is useful if one wants to know the identity of genes effected by GVs in addition to affected transcripts.

As you can see, the relevant columns are

column1: the chromosome/contig column4: start coordinate column5: end coordinate, column7: strand, column9: transcript ID

Make sure columns are separated by TAB characters (not spaces). The transcript ID in column9 can be preceded by either 'Parent=' , 'ID=' or 'transcript_id '.

(5.2) A reference genomic DNA file in FASTA format

Parameter: -r Self explanatory. Avoid incompatibilities between the transcript coordinates and the chromosomes (e.g. transcripts with coordinates out of boundaries, or including transcripts from chromosomes that are not in the fasta file). Entries in the FASTA file may have any line length up to 65,536 characters, and different line lengths are allowed in the same file. However, within a sequence entry, all lines must be the same length except for the last (see Bio::DB::Fasta module).

(5.3) A list of SNV, INS and DEL in either VCF4.0 format or TAB format

Parameter: -v (if using VCF4.0 format) or -t (if using TAB format) the Variant Call Format (VCF) was proposed by the 1,000 genomes project consortium as a standard to report variants. Specifications can be found in:

http://www.1000genomes.org/wiki/Analysis/vcf4.0

Note that if multiple alternative alleles are specified for a variant in the VCF file, CooVar will consider only the first of these alternative alleles. Please make sure that this is the allele that you want to annotate. Multiple alternative alleles per variant will be supported in a future release of CooVar.

The VCF is used by a number of programs that detect SNPs and small indels, including the BCF tools that comes with SAMTools for calling SNPs and small indels

http://samtools.sourceforge.net/mpileup.shtml

and Dindel

ftp://ftp.sanger.ac.uk/pub4/resources/software/dindel/manual-1.01.pdf

The TAB format is an alternative for the cases where the user doesn't have a VCF file. It is a tab-separated file meant to be easy to generate.

An example of a SNV line in the TAB input:

SNP chr1 10469 C G

column1: SNP indicates it is a single basepair difference between the reference and the variant. column2: chromosome/contig column3: coordinate within chromosome/contig column4: reference nucleotide at that coordinate column5: variant nucleotide at that coordinate

An example of an INS line in the TAB input:

INS chr1 719490 719491 ATAAT

column1: INS indicates it is an insertion in the variant compared to the reference. column2: chromosome/contig column3: start coordinate within chromosome/contig column4: end coordinate within chromosome/contig column5: insertion sequence

Note1: even though insertions are usually annotated to occur between 2 adjacent basepairs X and X+1, this is not required by CooVar. If the start and end coordinate are separated by 1 or more basepairs, the region in between is regarded as a deletion associated with the insertion. For example, if the insertion is annotated to occur betwen basepairs 10 and 20, then basepairs 11 to 19 will be deleted.

An example of a DEL line in the TAB input:

DEL chr1 768120 768122

column1: DEL indicates it is a deletion in the variant compared to the reference. column2: chromosome/contig column3: start coordinate within chromosome/contig column4: end coordinate within chromosome/contig

All base pairs from start to end (including start AND end) are deleted in the variant.

Note2: for INS and DEL the start coordinate HAS to be lesser or equal than the end coordinate, otherwise it will be discarded.

(5.4) Specifying the output directory

Parameter: -o Optional parameter that specifies the output directory for CooVar output files. If the specified output directory does not exist, CooVar creates it. If omitted, output files are written to a subdirectory in the current directory named "output_YYYYMMDD_hhmmss", whereas YYYY, MM, DD, hh, mm, ss denote year, month, day, hour, minute, and second of the current date/time.

(5.5) Specifying feature source and feature type

Parameter: --feature_source (optional; no default value) --feature_type (optional; default: 'CDS')

If --feature_source and --feature_type are not specified, CooVar by default retains only features of type 'CDS' from the input GFF/GTF file, assuming that these features represent coding exons of protein-coding transcripts.

If --feature_source is specified, CooVar considers only features of that source (= 2nd column of input GFF/GTF).

If --feature_type is specified, Coovar considers only features of that type (= 3rd column of input GFF/GTF). The default is 'CDS'.

These two parameters can also be used in combination.

(6) OUTPUT

Assuming that your files are called:

REFERENCE: the genomic DNA GVs: the list of Genomic Variations in either format TRANSCRIPTS: gff3 file with the coding exon coordinates

The output of CooVar ("CV") is structured as follows: . |-- transcripts.gff3: transcripts, GVs impacting them, status of the ORF. See description in section 6.1 below |-- categorized-gvs.gvf: each GV and its impact (or not, if silent) on all transcripts. See description in section 6.1 below |-- deletions | |-- distribution_deletions.out: length distribution of all deletions and of those deletions impacting exons. | |-- excluded_gvs.del: deletions excluded from the input for not accomplishing the specified format. | |-- kept_gvs.del: deletions kept from the input. The ones to be applied to the transcripts. | -- kept_gvs.del.circos: Circos frequency file for deletions. Designed for heatmap visualization. |-- insertions | |-- distribution_insertions.out: length distribution of all insertions and of those insertions impacting exons. | |-- excluded_gvs.ins: insertions excluded from the input for not accomplishing the specified format. | |-- kept_gvs.ins: deletions kept from the input. The ones to be applied to the transcripts. |-- kept_gvs.ins.circos: Circos frequency file for insertions. Designed for heatmap visualization. |-- snps | |-- ambiguous_snps.list: SNPs that fall into 2 or more categories by impacting transcripts differently | |-- categorized_snps.stat: overall and chromosomal frequencies of SNPs per category. | |-- codon_bias.stat: frequency of the location of the SNP within the codon (1,2, or 3) and its impact as syn/non-syn | |-- excluded_gvs.snp: SNPs excluded from the input for not accomplishing the specified format. | |-- kept_gvs.list.snp: SNPs kept from the input. The ones to be applied to the transcripts. | -- kept_gvs.list.snp.circos: Circos frequency file for deletions. Designed for heatmap visualization. |-- transcripts | |-- reference_cdna.exons: reference cDNA shown per exon, per transcript. | |-- variant_cdna.exons: variant cDNA shown per exon, per transcript. | |-- reference_cdna.fasta: reference cDNA for each transcript. | |-- variant_cdna.fasta: variant cDNA for each transcript. | |-- reference_peptides.fasta: reference peptide for each transcript. | |-- variant_peptides.fasta: variant peptide for each transcript. | |-- reference_variant.alignment: alignment of the reference and variant (i.e. all GVs applied) cDNA | |-- exons.circos: Circos frequency file for exons on a bp basis. Designed for heatmap visualization. |-- variant.stat: distribution of early stop codons along proteins, frequency of different ORF status and summary of impact of GVs on transcripts

(6.1) transcripts.gff3 and categorized-gvs.gvf

The Generic Feature Format Version 3 (GFF3) is a widely used and accepted format for representing genomic features such as trnascript models and others. The Genome Variation Format (GVF) is an extension of GFF3 and it is designed to describe the impact of Genomic Variants such as SNV, INS and DEL on different genomic features described in a GFF3 file. Hence, both files are meant to contain complementary information. More details and specifications for both formats can be found at

http://www.sequenceontology.org/

Given this, CooVar provides two outputs:

(6.1.1) transcripts.gff3

An example of a transcript in the transcripts.gff3 file is shown here:

chr20 CooVar mRNA 11790883 11830259 . - . ID=ENST00000427835;Note=snp_746931(conservative_missense_codon),ins_66552,del_83066,ORF_DISRUPTED(15.53) chr20 CooVar coding_exon 11790883 11791087 . - . Parent=ENST00000427835 chr20 CooVar coding_exon 11830155 11830259 . - . Parent=ENST00000427835

For each transcript, CooVar provides the coordinates of the overall coding region (mRNA) together with the coordinates for each coding exon. In the first line, The Note tag is used to describe , in a comma separated format,

all snps impacting the protein-coding transcript, and its categorization,
all insertions impacting the protein-coding transcript,
all deletions impacting the protein-coding transcript,
the overall impact of ALL GVs on the ORF (ORF_INTACT/ORF_DISRUPTED/ORF_PRESERVED/FULLY_DELETED)
if the ORF is found disrupted by the GVs provided as input, then a percentage in parenthesis indicates the location within the protein where the first stop codon occurs.

ORF_INTACT: the coding region of the transcript is not impacted by GVs. ORF_PRESERVED: the overall impact of GVs on the variant transcript is such that it doesn't introduce an early stop codon, compared to the reference. ORF_DISRUPTED: the overall impact of GVs on the variant transcript is such that it introduces an early stop codon, compared to
the reference. FULLY_DELETED: the variant transcript is such that only two, one or zero base pairs of the coding region remain after a deletion event.

Note: the information associated to each GV listed in the GFF3 file (e.g. snp_746931) can be found in the categorized-gvs.gvf file.

(6.1.2) categorized-gvs.gvf

An example of a SNV, INS and DEL in the categorized-gvs.gvf file is shown:

chr15 CooVar SNV 80988136 80988136 . + . ID=snp_14204; Variant_seq=G;Reference_seq=C;Variant_effect=non_conservative_missense_codon 0 mRNA ENST00000258884,ENST00000424700; Note=tgc>tgG_C>W_aa122_codon_loc3_RADICAL(215),tgc>tgG_C>W_aa122_codon_loc3_RADICAL(215) chr14 CooVar insertion 90740444 90740445 . + . ID=ins_1;Variant_seq=G;Reference_seq=~;Variant_effect=frameshift_variant 0 mRNA ENST00000380590 chr14 CooVar deletion 94935764 94935766 . + . ID=del_9;Variant_seq=-;Reference_seq=~;Variant_effect=inframe_variant 0 mRNA ENST00000380365,ENST00000337425,ENST00000448305

For each case, columns 1, 4 and 5 represent the chromosome, start and end coordinate of the GV.

In column 8:

Variant_seq represents the variant sequence for this GV.
Reference_seq represents the reference sequence for each GV.
Variant_effect represents the predicted effect of this GV on the ORF (e.g. stop gained, synonymous, conservative_missense, non_conservative_missense, etc.) format. All GVs impacting a transcript are found in the corresponding GFF3 file.
The "Note=" tag for SNPs can be lengthy and it displays the following information: -- codon_reference>codon_variant. Note that this codon can come from the + or - strand of the genomic DNA. -- aa_reference>aa_variant. -- location of the SNP within the codon (1,2 or 3). -- functional impact of the amino acid change for mis-sense mutations together with the associated Grantham Score in parenthesis.

(7) TIPS FOR RUNNING COOVAR FASTER

Some steps of CooVar can take time when run on very large (human sized) data files. However, even in this case CooVar should return results in a reasonable amount of time (~1h). One way to speed up execution is to run CooVar for different chromosomes separately. Also, consider commenting out lines in the coovar.pl script if re-running a particular step of the analysis is not required. For example, by commenting out the line executing the 'extract-cdna.pl' script, CooVar would not extract cDNA sequences when it is run again, using existing output files from previous runs.

(8) HOW TO CITE COOVAR

If you find CooVar useful in your work, please cite:

Vergara IA, Frech C and Chen N. CooVar: Co-occurring variant analyzer BMC Research Notes 2012, 5:615

Gig77 / CooVar

readme