probe-generator: make short sequences to probe for mutation events

probe-generator is a tool to make short sequences of base pairs representing fusion and SNP mutations. These probes can be used to screen high-throughput sequencing libraries for evidence of the events represented by the probe.

Installation

probe-generator requires Python v3.2 or later and the docopt python package v0.6.1 or later.

If you have root permissions, installation is as easy as using pip or easy_install to install docopt and the running the setup.py script:

$ easy_install docopt
Searching for docopt
Best match: docopt 0.6.1
Processing docopt-0.6.1-py2.6.egg
[...]
$ python3 setup.py install
running install
running build
running build_py
running install_lib
[...]

Installing in your home folder without root permissions is only slightly more complicated. This requires making a local directory for python packages and installing everything there:

$ mkdir -p $HOME/usr
$ echo 'export PYTHONPATH=$PYTHONPATH:$HOME/usr' >> ~/.bashrc
$ echo 'export PYTHONPATH=$PYTHONPATH:$HOME/usr/lib/python3.2/site-packages' >> ~/.bashrc
$ source ~/.bashrc
$ echo "[easy_install]" >> ~/.pydistutils.cfg
$ echo "install_dir = $HOME/usr" >> ~/.pydistutils.cfg
$ easy_install docopt
Searching for docopt
Best match: docopt 0.6.1
Processing docopt-0.6.1-py2.6.egg
[...]
$ python3 setup.py install --prefix $HOME/usr
running install
running build
running build_py
running install_lib
[...]

In either case, copy or link the script under bin/probe-generator to somewhere in your $PATH and test that everything worked:

$ probe-generator --version
ProbeGenerator version 0.5

Probe statements

The input to probe-generator consists short strings called probe statements. Probe statements are brief descriptions of the genomic locations of mutations.

probe-generator currently supports three different types of probe statements. Exon statements and coordinate statements specify fusion events, while SNP statements specify single nucleotide polymorphisms.

Some elements of probes can be replaced with the glob character ('*'), indicating that any value is acceptable.

Exon statements

Exon statements are in the form:

"{gene}#exon[{number}] {side}{bases} {sep} {gene}#exon[{number}] {side}{bases}"

{gene}:    the name of the gene of interest. Acceptable characters are
           letters, numbers, and '_-/.'. Case is not significant.

{number}:  the cardinality of the feature of interest (1 for the first exon,
           etc.). Must be a digit or '*'.

{side}:    Whether to return a sequence at the start or end of the feature.
           Acceptable characters are '+-*'.

{bases}:   The length of probe sequence to return for this feature. Must be
           a digit or '*'.

{sep}:     The separator. One of '/' or '->'. Determines the probing strategy.

Any of "{feature}", "{number}", "{side}", or "{bases}" can be replaced with the glob character ("*"). In the "{bases}" field, the interpretation is that the sequence of the entire exon will be included in the probe.

White-space between the elements of the statement is ignored.

Ambiguity

Exon statements do not necessarily specify a unique location in the genome. When a probe statement could refer to any of a number of sequences of nucletoides, probe-generator returns all of them.

A probe statement can be ambiguous for three reasons:

Globbing
Alternative transcripts
Non-unique gene names

Consider the following probe statement:

FOO#exon[3] -25 / BAR#exon[*] +25

Imagine that FOO is alternatively spliced, so that there are two different exons that could possibly be called the third. Furthermore, we will assume that the symbol BAR identifies two different genes, each with two exons. In this case, eight different probes will be generated:

> FOO exon 3a / BARa exon 1
> FOO exon 3a / BARa exon 2
> FOO exon 3a / BARb exon 1
> FOO exon 3a / BARb exon 2
> FOO exon 3b / BARa exon 1
> FOO exon 3b / BARa exon 2
> FOO exon 3b / BARb exon 1
> FOO exon 3b / BARb exon 2

where 3a and 3b are the two possible third exons of FOO and BARa and BARb are the two genes called 'BAR'.

In many cases, most or all of the exon junctions in two alternative transcripts are redundant. probe-generator will not print more than one probe with identical genomic coordinates.

Probing strategy

In determining the sequence generated by a fusion between two exons, there is the problem of determining which strand the exons lie on, and whether either exon was reverse-complemented relative to its canonical representation when the fusion occurred. probe-generator automatically reverse-complements exons to generate the correct probe sequence to the greatest degree possible using a user-specified probing strategy.

There are two probing strategies for exon probes: positional and read-through.

Positional probes, indicated by the '/' separator, are created by appending the bases indicated on the left of the separator to the bases indicated on the right, regardless of the orientation of the genes. This is the most flexible way to specify a probe, but it may require the user to know the orientations of the events when specifying a probe.

Read-through probes, indicated by the '->' separator, are used to specify a fusion such that transcription may continue from the end of the first gene to the start of the second gene, resulting in a fusion transcript. This is very useful for specifying probes for oncogenic fusion events, but it is less flexible than a positional representation, as read-through probes must be specified as joining the end of the first exon to the start of the second.

Consider these two (simplified) probe statements:

ABC-3 /  DEF+3
ABC-3 -> DEF+3

If the both features are on the plus strand, the two statements are equivalent:

    ABC             DEF
    |----->         +=====>
    .......................
    .......................

    probe: -->+==>

If both are on the minus strand, however, the statements result in different probes:

    .......................
    .......................
    <-----|         <=====+
    ABC             DEF

    positional:    <--==+
    read-through:  ==+<--

Note that the read-through statement rearranges the probe so that the end of ABC is joined to the beginning of DEF.

Read-through statements are normally specified so that the end of the first feature is fused to the beginning of the second. If some other arrangement is used, a warning message is printed.

Examples

To specify a probe covering the last 20 bases of the first exon of the gene ABC and the first 30 bases of the third exon of DEF, you would pass the following probe statement:

"ABC#exon[1] -20 / DEF#exon[3] +30"

The same fusion, but with any exon of DEF:

"ABC#exon[1] -20 / DEF#exon[*] +30"

Any fusion between exons of FOO and BAR with exactly 40 bases covered:

"FOO#exon[*] *20 / BAR#exon[*] *20"

A 50 base-pair 'read-through' fusion between the first exon of BAM and any exon of POW:

"BAM#exon[1] -25 -> POW#exon[*] +25"

Any fusion between any two exons of SPAM and EGGS, with the entirety of both features covered:

"SPAM#exon[*] ** / EGGS#exon[*] **"

Coordinate statements

The exact genomic location of the breakpoints of the fusion event can also be specified directly using the coordinate-statement format:

"{chromosome}:{breakpoint}{+|-}{bases}/{chromosome}:{breakpoint}{+|-}{bases}"

No globbing is allowed in coordinate statements. Any white-space between elements of the statement is ignored.

Note that, unlike exon statements, a coordinate statement always corresponds to exactly one probe sequence.

Examples

A probe for a fusion event between the 100th base pair of chromosome 1 and the 200th base pair of chromosome Y, with 25 bases on either side:

"1:100-25/Y:200+25"

SNP statements

An SNP statement, as the name suggests, specifies a probe for a single-nucleotide polymorphism event. The syntax is as follows:

"chromosome: coordinate reference > mutation / bases"

The "reference > mutation" element specifies the polymorphism (e.g. "A>C"). The reference, the mutation, or both can be replaced by a '*' character. If the reference and the mutation base are identical, no probe sequence is produced.

The length of the probe sequence is determined by 'bases'. The mutant base is in the centre of the probe (or as near as possible when the number of bases is even).

If the base in the reference genome is not the specified reference base (e.g.: an 'A>T' mutation is specified and the reference base is 'C'), no probe sequence is generated and a warning message is printed.

Globbing

If the reference base is globbed, probe sequences will be generated for both the base at that location in the genome and its reverse-complement.

If a glob character is supplied for the mutation base, all three possible SNPs at that genomic location are generated. Globbing the reference as well as the mutation base gives all six possible probes.

Examples

An A>C mutation at the 100th base of the X chromosome with 25 bases either side of it:

 "X:100 A>C /51"

Probes for both the A>C event specified above and a T>C event on the opposite strand:

 "X:100 *>C /51"

A ten base pair probe for any mutation at the 1000th base pair of chromosome 3:

 "3:1000 *>* /10"

SNP statements in transcripts

It is also possible to specify SNPs relative to a transcript. To specify a mutation for a particular nucleotide, use the following syntax:

"{gene}: c.{nucleotide} {reference base} > {mutation base} / {bases}"

It is also possible to specify probes for an amino acid change using this syntax:

"{gene}: {reference AA} {codon} {mutation AA} / {bases}"

For instance, to specify a 50 base pair probe for a C to G mutation at the 27th nucleotide of the FOO gene, use the statement:

"FOO: c.27 C>G /50"

This statement specifies a proline to histidine mutation at the 50th codon of FOO:

"FOO: P50H /50"

An anti-sense mutation at the same location is given by:

"FOO: P50* /50"

As usual, neither case nor white space is significant (except in the name of the gene).

Globbing is not supported when specifying SNP's relative to a transcript.

In both cases, no probe sequence is produced and a warning is printed if the sequence in the reference genome does not match the sequence specified in the probe statement.

Sequences are reverse-complemented automatically when the specified transcript is on the minus strand.

If there is more than one transcript matching the gene name given, probes are produced for all of them (although frequently the reference sequence will not match for alternative transcripts, meaning that no probe sequence is produced).

In the case that the amino acid is specified, one probe is produced for every codon which codes for an amino acid. For instance, specifying leucine as the mutation amino acid results in at least six probe sequences, some of which will have more than one base pair difference from the reference genome. All possible codons are also used for the reference amino acid, although obviously a maximum of one will match the reference genome.

For example, the probe statement:

"ALK: Q115R /50"

will result in (2 proline codons) x (4 arginine codons) = 8 probes, four specifying CAA as the reference sequence and four specifying CAG. If the reference sequence at the specified location is CAA, four probe sequences will be produced (one for each codon which codes for arginine).

Insertion/deletion probes

Insertion and deletion (indel) probes are specified in the same way as SNP probes. Indels can only be specified relative to a transcript at present.

Indel statements are similar to SNP statements, except that instead of the 'N>N' variant specification, the user supplies 'ins' and 'del' fields specifying the base(s) inserted and deleted. Both are optional (but you need at least one, obviously). As usual, case is insignificant. If both fields are used, 'del' must be specified before 'ins'.

There is no globbing mechanism for indel probes at present. Both the insertion and deletion sequences must be fully specified.

The insertion and deletion sequences can be any length (although the inserted sequence must be shorter than the total length of the probe).

If the deleted sequence requested crosses an exon/exon junction, an error is printed and no probe is produced. In this case, there is not a single, unique sequence that can be said to correspond to the variant.

Examples

FOO: c.100 del ACGT /50 --four base-pair deletion

BAR: c.50 ins gt /20 --two base-pair insertion

TOB2: c.703 del CT ins AAA /50 --both insertion and deletion specified

X: c.100 del C ins A /50 --same as X: c.100 C>A/50

Using transcriptome sequence only

You can generate probes using only transcribed sequences by using the '[trans]' flag in indel or SNP probes where the gene is specified. If this option is used and the probe includes an exon/exon junction, the probe sequence will be taken from the next exon rather than the intronic sequence.

If the variant is not near an exon/exon junction then there is no difference between standard probes and those which use transcript sequence only.

This feature should be considered experimental.

Examples

FOO: P50H [trans] /50
ALK: Q115R [trans] /50
TOB2: c.703 del CT ins AAA [trans] /50

Comments

Any probe statement can be followed by a comment. Comments have no effect, except that they are printed in the head of the probe. This is used to attach additional information to the probe.

Comments start with two hyphens ('--') and continue to the end of the line.

E.g:

 "X:100 A>C /51                    -- SNP found in patient XYZ"
 "FOO#exon[1]-25 -> BAR#exon[3]+25 -- confers resistance to madeupifam"
 "2:119726736+25/2:121555044+25    -- GLI2/MARCO fusion"

Note that comments are not modified by probe-generator in any way. In particular, unlike the probe statements themselves, white-space is not stripped.

Usage

    probe-generator --statements FILE --genome FILE [--annotation FILE...] [-f]

Options:
    -s FILE --statements=FILE       a file containing probe statements
    -g FILE --genome=FILE           the reference genome (FASTA format)
    -a FILE --annotation=FILE       a genome annotation file in UCSC format
    -f --force                      run even if the total system memory is
                                    insufficient or cannot be determined

The 'statements' file can contain any of the flavours of probe statements described above, or a mixture.

Genome annotations

When using exon probe statements, probe-generator requires a UCSC genome annotation in order to determine the boundaries of the exons. Currently, the RefSeq Genes and UCSC Genes annotation files are supported. More than one annotation file can be specified for a single run:

$ probe-generator -s statements.txt -g genome.fa \
                  -a refseq_genes.txt            \
                  -a ucsc_genes.txt

Annotations can be downloaded from the UCSC table browser. Make sure to use the output format 'all fields from selected table'.

To prevent memory errors (see below), probe-generator will raise a warning if it detects that the total system memory in the current environment is less than 10Gb, or if the total system memory cannot be determined (the memory can only be determined on a Linux system at present).

The --force flag can be used to override this warning in testing situations with a small genome reference file, or if the user is pretty sure that enough memory is available. Running with --force set is STRONGLY discouraged for ordinary use, however.

Output

Probes are printed to standard out in FASTA format. The contents of the headers of the probes depend on the type of statement used to specify the probes. In general, the header contains the probe statement (stripped of white-space if necessary) followed by the comment if any.

Note that white-space is stripped only from the probe statement, not the comments.

If exon statements are used, the header consists of the probe statement (expanded if necessary), the coordinates of the probe and the unique identifiers of the transcripts used in determining the location of the probe:

# FOO#exon[1] -10 -> BAR#exon[*] +10 -->

>FOO#exon[1]-10->BAR#exon[1]_+10_1:100/2:200_N000001_N0000002
ACGTTACGTTGCGCGCGCGC
>FOO#exon[1]-10->BAR#exon[2]+10_1:100/2:250_N000001_N0000002
ACGTTACGTTATATATATAT
... etc

In SNP probes which specify a transcript, the coordinate of the the SNP and the transcript ID are added:

# FOO: c.123 T>A /5 -->

>FOO:c.123_T>A/5_N00001_1:100
TTATT

When an amino acid change is specified, the coordinate of the first base pair of the codon (not necessarily the coordinate of the mutation) as well as the reference and mutation codon sequences are also given:

# FOO:L50*/5

>FOO:L50*(TTA>TAA)/5_N00001_1:100
GTAAG

Performance

Using the hg19 human genome reference, probe-generator uses about 15.5 Gb of memory at peak. This will run comfortably on all.q or xhost08, but I don't recommend trying it on your workstation unless you have a much nicer computer than mine.

As a rule of thumb, the peak memory usage will be about 5 times the size of the sum of the text input (annotations and genome) on disk.

Troubleshooting

probe-generator often produces many warning messages due to reference mismatches in alternatively-spliced transcripts or ambiguous reference sequences (caused by globbing or specifying an amino acid mutation). In most cases, these can be ignored.

If no probe sequences can be produced for a particular probe, a warning message starting with "WARNING" (in capitals) will be printed to standard error, along with the statement itself. The most common causes are:

 - The probe statement could not be parsed
    * Check the syntax of the statement

 - No annotation file was provided for a statement specifying a transcript
    * Check that at least one annotation file was specified using the '-a'
      flag when running the `probe-generator` command

 - The gene could not be found in any of the annotation files provided
    * Check that the name of the gene is spelled correctly
    * Check that the case of the gene name is correct
    * Check to see whether the gene has any synonyms, e.g.:
        - "GOPC "and "FIG "refer to the same gene
        - "NEURL" is called "NEURL1" in the UCSC annotations

- The reference genome does not match any of the possible values for the
  specified reference sequence
    * Check that the coordinate and reference sequence are specified
      correctly in the probe

bcgsc / ProbeGenerator

readme

Installation

Probe statements

Exon statements

Ambiguity

Probing strategy

Examples

Coordinate statements

Examples

SNP statements

Globbing

Examples

SNP statements in transcripts

Insertion/deletion probes

Examples

Using transcriptome sequence only

Examples

Comments

Usage

Genome annotations

Output

Performance

Troubleshooting