Histone Catalogue

This project provides a catalogue of canonical core histone genes, encoded proteins, and pseudogenes based on reference genome annotations. It also provides context on the variation of histone properties, isoforms, clusters, and their nomenclature.

Since curation and annotation are dynamic and evolving, the catalogue generates a live publication that can provide always up-to-date information in an accessible format.

Inspired by the ideals of reproducible research, this project contains all the code required to automate and create a new build of the catalogue from current annotations. This is based on SCons, an automated software build system.

Instructions

To build the catalogue:

Install linux and software dependencies
Pull the histone-catalogue project from github
Run scons with the build target from the histone-catalogue directory

Running scons

Running scons will check that all required software is installed, search the databases for the histone genes, download all required sequences, analyze the sequences, generate figures and compile the catalogue in PDF format as required.

Example command to generate a fully updated manuscript PDF.

scons \
    --api_key='xxxxxxxxxxxxxxxxxxx' \
    --email='example@domain.top' \
    update manuscript

For a complete list of targets and options.

scons -h

Choosing the build target

target as manuscript

The 'manuscript' comprises all tables and figures as a PDF embedded in a contextual discussion of canonical histone genes and proteins. Some additional analyses are also included.

This is the format of the published histone catalogue and probably what you want.

scons manuscript

target as catalogue

The 'catalogue' is a PDF with multiple tables and figures but not embedded in a manuscript context. Catalogue is the default if no target is specified, so the following are equivalent.

scons catalogue
scons

target including update

The full sequence download process can take quite some time due to Entrez data access rate limitations. If you have previously downloaded sequence data, your build defaults to the existing sequences. The build does NOT automatically download new data unless you specify 'update'

To force a refresh of the Entrez data include 'update' in the target.

scons update catalogue
scons update manuscript

Note that the sequence release datestamp is shown in the caralogue and manuscript PDFs.

target as data

To download only the sequence data without performing any analysis, use the `data' target. This is equivalent to 'update' but without building a catalogue or manuscript. It makes all sequences available in csv format in the results/sequences subdirectory of histone-catalogue.

scons data

Note that the data subdirectory contains certain fixed data required for the builds, not the sequence data.

Additional options

email

The Entrez databases searched via E-utilities requires an email address, although this is not enforced. This email is a politeness allowing NCBI staff to contact you in case you accidentally overload their servers.

In order not to overload the E-utility servers, NCBI recommends that users post no more than three URL requests per second and limit large jobs to either weekends or between 9:00 PM and 5:00 AM Eastern time during weekdays. Failure to comply with this policy may result in an IP address being blocked from accessing NCBI. [...] The value of email will be used only to contact developers if NCBI observes requests that violate our policies, and we will attempt such contact prior to blocking access.

For more details, see section "Usage guidelines and requirements", on A General Introduction to the E-utilities.

You should provide the --email option:

scons --email=example@domain.top

api key

The Entrez databases allow faster retrieval of 10 records per second if an API key is included. Users can be obtain this free via the MyNCBI interface. Note that you should also include your email address.

You can include the api key using the --api_key option:

scons --api_key='xxxxxxxxxxxxxxxxxxx' --email='example@domain.top'

organism

A catalogue of human histones is generated by default. Other organisms can be specified using the `--organism' option. This is heavily dependent on the annotation state of the organism reference genome and has only been tested by us for human and mice.

scons --organism='mus musculus'

Error downloading data

When you're downloading fresh data from the NCBI servers, you may come across the following (or similar) error:

------------- EXCEPTION: Bio::Root::Exception -------------
MSG: Response Error
Bad Request
STACK: Error::throw
STACK: Bio::Root::Root::throw /usr/share/perl5/Bio/Root/Root.pm:449
STACK: Bio::DB::GenericWebAgent::get_Response /usr/share/perl5/Bio/DB/GenericWebAgent.pm:215
STACK: Bio::DB::EUtilities::get_Response /usr/share/perl5/Bio/DB/EUtilities.pm:34
STACK: main::analyze_entrez_genes /usr/bin/bp_genbank_ref_extractor:239
STACK: main::analyze_entrez_genes /usr/bin/bp_genbank_ref_extractor:312
STACK: /usr/bin/bp_genbank_ref_extractor:177

If you do, then try again, either in the weekend or during the night of Eastern Time zone. Also, ensure you specify an email address and a valid NCBI E-utilities API key.

Installing Linux and software dependencies

While technically possible to build this on Windows or MacOS, it is far easier to do it on Linux. If you do not have a Linux system available, the histone-catalogue can be created using Ubuntu on a USB stick or in Virtualbox on a PC or Mac.

Installing Ubuntu on a USB stick
Installing Ubuntu in Virtualbox

Dependencies

A number of software components are required to build the histone catalogue:

SCons which provides the build system.
pdflatex, bibtex, epstopdf, and several other latex packages are required to build the catalogue and manuscript pdf files. Simplest method is to install TeX Live which provides all of them in a single distribution.
Perl as well as several perl modules.
bp_genbank_ref_extractor which is used for search and download of sequences is part of bioperl's Bio-EUtilities distribution.
weblogo to generate the sequence logos.

A complete list of required perl modules and latex packages is listed via `scons -h'.

If you are on Linux Debian, or a Debian derivative such as Ubuntu, these are pre-packaged for you. See the following instructions:

Directory structure

data - data that is not automatically generated such as the data from Marzluff 2002 paper which we use as reference for comparison.
figs - figures generated during the build.
lib-perl5 - library for handling sequences and needed by our perl scripts.
results - data after processing. Includes aligned sequences, as well as LaTeX tables and variable definitions.
results/sequences - sequences downloaded as part of the build.
scripts - collection of scripts for data analysis.
sections - LaTeX source for the different manuscript sections.
site_scons - scons configuration for this project.
t - tests for lib-perl5.

af-lab / histone-catalogue

readme