probe-generator: make short sequences to probe for mutation events
probe-generator
is a tool to make short sequences of base pairs representing
fusion and SNP mutations. These probes can be used to screen high-throughput
sequencing libraries for evidence of the events represented by the probe.
probe-generator
requires Python v3.2 or later and the docopt
python package
v0.6.1 or later.
If you have root permissions, installation is as easy as using pip
or
easy_install
to install docopt and the running the setup.py
script:
$ easy_install docopt
Searching for docopt
Best match: docopt 0.6.1
Processing docopt-0.6.1-py2.6.egg
[...]
$ python3 setup.py install
running install
running build
running build_py
running install_lib
[...]
Installing in your home folder without root permissions is only slightly more complicated. This requires making a local directory for python packages and installing everything there:
$ mkdir -p $HOME/usr
$ echo 'export PYTHONPATH=$PYTHONPATH:$HOME/usr' >> ~/.bashrc
$ echo 'export PYTHONPATH=$PYTHONPATH:$HOME/usr/lib/python3.2/site-packages' >> ~/.bashrc
$ source ~/.bashrc
$ echo "[easy_install]" >> ~/.pydistutils.cfg
$ echo "install_dir = $HOME/usr" >> ~/.pydistutils.cfg
$ easy_install docopt
Searching for docopt
Best match: docopt 0.6.1
Processing docopt-0.6.1-py2.6.egg
[...]
$ python3 setup.py install --prefix $HOME/usr
running install
running build
running build_py
running install_lib
[...]
In either case, copy or link the script under bin/probe-generator
to
somewhere in your $PATH and test that everything worked:
$ probe-generator --version
ProbeGenerator version 0.5
The input to probe-generator
consists short strings called probe
statements. Probe statements are brief descriptions of the genomic locations
of mutations.
probe-generator
currently supports three different types of probe
statements. Exon statements and coordinate statements specify fusion
events, while SNP statements specify single nucleotide polymorphisms.
Some elements of probes can be replaced with the glob character ('*'), indicating that any value is acceptable.
Exon statements are in the form:
"{gene}#exon[{number}] {side}{bases} {sep} {gene}#exon[{number}] {side}{bases}"
{gene}: the name of the gene of interest. Acceptable characters are
letters, numbers, and '_-/.'. Case is not significant.
{number}: the cardinality of the feature of interest (1 for the first exon,
etc.). Must be a digit or '*'.
{side}: Whether to return a sequence at the start or end of the feature.
Acceptable characters are '+-*'.
{bases}: The length of probe sequence to return for this feature. Must be
a digit or '*'.
{sep}: The separator. One of '/' or '->'. Determines the probing strategy.
Any of "{feature}", "{number}", "{side}", or "{bases}" can be replaced with the glob character ("*"). In the "{bases}" field, the interpretation is that the sequence of the entire exon will be included in the probe.
White-space between the elements of the statement is ignored.
Exon statements do not necessarily specify a unique location in the
genome. When a probe statement could refer to any of a number of sequences of
nucletoides, probe-generator
returns all of them.
A probe statement can be ambiguous for three reasons:
Consider the following probe statement:
FOO#exon[3] -25 / BAR#exon[*] +25
Imagine that FOO is alternatively spliced, so that there are two different exons that could possibly be called the third. Furthermore, we will assume that the symbol BAR identifies two different genes, each with two exons. In this case, eight different probes will be generated:
> FOO exon 3a / BARa exon 1
> FOO exon 3a / BARa exon 2
> FOO exon 3a / BARb exon 1
> FOO exon 3a / BARb exon 2
> FOO exon 3b / BARa exon 1
> FOO exon 3b / BARa exon 2
> FOO exon 3b / BARb exon 1
> FOO exon 3b / BARb exon 2
where 3a and 3b are the two possible third exons of FOO and BARa and BARb are the two genes called 'BAR'.
In many cases, most or all of the exon junctions in two alternative transcripts
are redundant. probe-generator
will not print more than one probe with
identical genomic coordinates.
In determining the sequence generated by a fusion between two exons, there is
the problem of determining which strand the exons lie on, and whether either
exon was reverse-complemented relative to its canonical representation when the
fusion occurred. probe-generator
automatically reverse-complements exons to
generate the correct probe sequence to the greatest degree possible using a
user-specified probing strategy.
There are two probing strategies for exon probes: positional and read-through.
Positional probes, indicated by the '/' separator, are created by appending the bases indicated on the left of the separator to the bases indicated on the right, regardless of the orientation of the genes. This is the most flexible way to specify a probe, but it may require the user to know the orientations of the events when specifying a probe.
Read-through probes, indicated by the '->' separator, are used to specify a fusion such that transcription may continue from the end of the first gene to the start of the second gene, resulting in a fusion transcript. This is very useful for specifying probes for oncogenic fusion events, but it is less flexible than a positional representation, as read-through probes must be specified as joining the end of the first exon to the start of the second.
Consider these two (simplified) probe statements:
ABC-3 / DEF+3
ABC-3 -> DEF+3
If the both features are on the plus strand, the two statements are equivalent:
ABC DEF
|-----> +=====>
.......................
.......................
probe: -->+==>
If both are on the minus strand, however, the statements result in different probes:
.......................
.......................
<-----| <=====+
ABC DEF
positional: <--==+
read-through: ==+<--
Note that the read-through statement rearranges the probe so that the end of ABC is joined to the beginning of DEF.
Read-through statements are normally specified so that the end of the first feature is fused to the beginning of the second. If some other arrangement is used, a warning message is printed.
To specify a probe covering the last 20 bases of the first exon of the gene ABC and the first 30 bases of the third exon of DEF, you would pass the following probe statement:
"ABC#exon[1] -20 / DEF#exon[3] +30"
The same fusion, but with any exon of DEF:
"ABC#exon[1] -20 / DEF#exon[*] +30"
Any fusion between exons of FOO and BAR with exactly 40 bases covered:
"FOO#exon[*] *20 / BAR#exon[*] *20"
A 50 base-pair 'read-through' fusion between the first exon of BAM and any exon of POW:
"BAM#exon[1] -25 -> POW#exon[*] +25"
Any fusion between any two exons of SPAM and EGGS, with the entirety of both features covered:
"SPAM#exon[*] ** / EGGS#exon[*] **"
The exact genomic location of the breakpoints of the fusion event can also be specified directly using the coordinate-statement format:
"{chromosome}:{breakpoint}{+|-}{bases}/{chromosome}:{breakpoint}{+|-}{bases}"
No globbing is allowed in coordinate statements. Any white-space between elements of the statement is ignored.
Note that, unlike exon statements, a coordinate statement always corresponds to exactly one probe sequence.
A probe for a fusion event between the 100th base pair of chromosome 1 and the 200th base pair of chromosome Y, with 25 bases on either side:
"1:100-25/Y:200+25"
An SNP statement, as the name suggests, specifies a probe for a single-nucleotide polymorphism event. The syntax is as follows:
"chromosome: coordinate reference > mutation / bases"
The "reference > mutation" element specifies the polymorphism (e.g. "A>C"). The reference, the mutation, or both can be replaced by a '*' character. If the reference and the mutation base are identical, no probe sequence is produced.
The length of the probe sequence is determined by 'bases'. The mutant base is in the centre of the probe (or as near as possible when the number of bases is even).
If the base in the reference genome is not the specified reference base (e.g.: an 'A>T' mutation is specified and the reference base is 'C'), no probe sequence is generated and a warning message is printed.
If the reference base is globbed, probe sequences will be generated for both the base at that location in the genome and its reverse-complement.
If a glob character is supplied for the mutation base, all three possible SNPs at that genomic location are generated. Globbing the reference as well as the mutation base gives all six possible probes.
An A>C mutation at the 100th base of the X chromosome with 25 bases either side of it:
"X:100 A>C /51"
Probes for both the A>C event specified above and a T>C event on the opposite strand:
"X:100 *>C /51"
A ten base pair probe for any mutation at the 1000th base pair of chromosome 3:
"3:1000 *>* /10"
It is also possible to specify SNPs relative to a transcript. To specify a mutation for a particular nucleotide, use the following syntax:
"{gene}: c.{nucleotide} {reference base} > {mutation base} / {bases}"
It is also possible to specify probes for an amino acid change using this syntax:
"{gene}: {reference AA} {codon} {mutation AA} / {bases}"
For instance, to specify a 50 base pair probe for a C to G mutation at the 27th nucleotide of the FOO gene, use the statement:
"FOO: c.27 C>G /50"
This statement specifies a proline to histidine mutation at the 50th codon of FOO:
"FOO: P50H /50"
An anti-sense mutation at the same location is given by:
"FOO: P50* /50"
As usual, neither case nor white space is significant (except in the name of the gene).
Globbing is not supported when specifying SNP's relative to a transcript.
In both cases, no probe sequence is produced and a warning is printed if the sequence in the reference genome does not match the sequence specified in the probe statement.
Sequences are reverse-complemented automatically when the specified transcript is on the minus strand.
If there is more than one transcript matching the gene name given, probes are produced for all of them (although frequently the reference sequence will not match for alternative transcripts, meaning that no probe sequence is produced).
In the case that the amino acid is specified, one probe is produced for every codon which codes for an amino acid. For instance, specifying leucine as the mutation amino acid results in at least six probe sequences, some of which will have more than one base pair difference from the reference genome. All possible codons are also used for the reference amino acid, although obviously a maximum of one will match the reference genome.
For example, the probe statement:
"ALK: Q115R /50"
will result in (2 proline codons) x (4 arginine codons) = 8 probes, four specifying CAA as the reference sequence and four specifying CAG. If the reference sequence at the specified location is CAA, four probe sequences will be produced (one for each codon which codes for arginine).
Insertion and deletion (indel) probes are specified in the same way as SNP probes. Indels can only be specified relative to a transcript at present.
Indel statements are similar to SNP statements, except that instead of the 'N>N' variant specification, the user supplies 'ins' and 'del' fields specifying the base(s) inserted and deleted. Both are optional (but you need at least one, obviously). As usual, case is insignificant. If both fields are used, 'del' must be specified before 'ins'.
There is no globbing mechanism for indel probes at present. Both the insertion and deletion sequences must be fully specified.
The insertion and deletion sequences can be any length (although the inserted sequence must be shorter than the total length of the probe).
If the deleted sequence requested crosses an exon/exon junction, an error is printed and no probe is produced. In this case, there is not a single, unique sequence that can be said to correspond to the variant.
FOO: c.100 del ACGT /50 --four base-pair deletion
BAR: c.50 ins gt /20 --two base-pair insertion
TOB2: c.703 del CT ins AAA /50 --both insertion and deletion specified
X: c.100 del C ins A /50 --same as X: c.100 C>A/50
You can generate probes using only transcribed sequences by using the '[trans]' flag in indel or SNP probes where the gene is specified. If this option is used and the probe includes an exon/exon junction, the probe sequence will be taken from the next exon rather than the intronic sequence.
If the variant is not near an exon/exon junction then there is no difference between standard probes and those which use transcript sequence only.
This feature should be considered experimental.
FOO: P50H [trans] /50
ALK: Q115R [trans] /50
TOB2: c.703 del CT ins AAA [trans] /50
Any probe statement can be followed by a comment. Comments have no effect, except that they are printed in the head of the probe. This is used to attach additional information to the probe.
Comments start with two hyphens ('--') and continue to the end of the line.
E.g:
"X:100 A>C /51 -- SNP found in patient XYZ"
"FOO#exon[1]-25 -> BAR#exon[3]+25 -- confers resistance to madeupifam"
"2:119726736+25/2:121555044+25 -- GLI2/MARCO fusion"
Note that comments are not modified by probe-generator
in any way. In
particular, unlike the probe statements themselves, white-space is not
stripped.
probe-generator --statements FILE --genome FILE [--annotation FILE...] [-f]
Options:
-s FILE --statements=FILE a file containing probe statements
-g FILE --genome=FILE the reference genome (FASTA format)
-a FILE --annotation=FILE a genome annotation file in UCSC format
-f --force run even if the total system memory is
insufficient or cannot be determined
The 'statements' file can contain any of the flavours of probe statements described above, or a mixture.
When using exon probe statements, probe-generator
requires a UCSC genome
annotation in order to determine the boundaries of the exons. Currently, the
RefSeq Genes and UCSC Genes annotation files are supported. More than one
annotation file can be specified for a single run:
$ probe-generator -s statements.txt -g genome.fa \
-a refseq_genes.txt \
-a ucsc_genes.txt
Annotations can be downloaded from the UCSC table browser. Make sure to use the output format 'all fields from selected table'.
To prevent memory errors (see below), probe-generator
will raise a warning if
it detects that the total system memory in the current environment is less than
10Gb, or if the total system memory cannot be determined (the memory can only
be determined on a Linux system at present).
The --force
flag can be used to override this warning in testing situations
with a small genome reference file, or if the user is pretty sure that enough
memory is available. Running with --force
set is STRONGLY discouraged for
ordinary use, however.
Probes are printed to standard out in FASTA format. The contents of the headers of the probes depend on the type of statement used to specify the probes. In general, the header contains the probe statement (stripped of white-space if necessary) followed by the comment if any.
Note that white-space is stripped only from the probe statement, not the comments.
If exon statements are used, the header consists of the probe statement (expanded if necessary), the coordinates of the probe and the unique identifiers of the transcripts used in determining the location of the probe:
# FOO#exon[1] -10 -> BAR#exon[*] +10 -->
>FOO#exon[1]-10->BAR#exon[1]_+10_1:100/2:200_N000001_N0000002
ACGTTACGTTGCGCGCGCGC
>FOO#exon[1]-10->BAR#exon[2]+10_1:100/2:250_N000001_N0000002
ACGTTACGTTATATATATAT
... etc
In SNP probes which specify a transcript, the coordinate of the the SNP and the transcript ID are added:
# FOO: c.123 T>A /5 -->
>FOO:c.123_T>A/5_N00001_1:100
TTATT
When an amino acid change is specified, the coordinate of the first base pair of the codon (not necessarily the coordinate of the mutation) as well as the reference and mutation codon sequences are also given:
# FOO:L50*/5
>FOO:L50*(TTA>TAA)/5_N00001_1:100
GTAAG
Using the hg19 human genome reference, probe-generator
uses about 15.5 Gb of
memory at peak. This will run comfortably on all.q
or xhost08
, but I don't
recommend trying it on your workstation unless you have a much nicer computer
than mine.
As a rule of thumb, the peak memory usage will be about 5 times the size of the sum of the text input (annotations and genome) on disk.
probe-generator
often produces many warning messages due to reference
mismatches in alternatively-spliced transcripts or ambiguous reference
sequences (caused by globbing or specifying an amino acid mutation). In most
cases, these can be ignored.
If no probe sequences can be produced for a particular probe, a warning message starting with "WARNING" (in capitals) will be printed to standard error, along with the statement itself. The most common causes are:
- The probe statement could not be parsed
* Check the syntax of the statement
- No annotation file was provided for a statement specifying a transcript
* Check that at least one annotation file was specified using the '-a'
flag when running the `probe-generator` command
- The gene could not be found in any of the annotation files provided
* Check that the name of the gene is spelled correctly
* Check that the case of the gene name is correct
* Check to see whether the gene has any synonyms, e.g.:
- "GOPC "and "FIG "refer to the same gene
- "NEURL" is called "NEURL1" in the UCSC annotations
- The reference genome does not match any of the possible values for the
specified reference sequence
* Check that the coordinate and reference sequence are specified
correctly in the probe