This project provides a catalogue of canonical core histone genes, encoded proteins, and pseudogenes based on reference genome annotations. It also provides context on the variation of histone properties, isoforms, clusters, and their nomenclature.
Since curation and annotation are dynamic and evolving, the catalogue generates a live publication that can provide always up-to-date information in an accessible format.
Inspired by the ideals of reproducible research, this project contains all the code required to automate and create a new build of the catalogue from current annotations. This is based on SCons, an automated software build system.
To build the catalogue:
Running scons will check that all required software is installed, search the databases for the histone genes, download all required sequences, analyze the sequences, generate figures and compile the catalogue in PDF format as required.
Example command to generate a fully updated manuscript PDF.
scons \
--api_key='xxxxxxxxxxxxxxxxxxx' \
--email='example@domain.top' \
update manuscript
For a complete list of targets and options.
scons -h
The 'manuscript' comprises all tables and figures as a PDF embedded in a contextual discussion of canonical histone genes and proteins. Some additional analyses are also included.
This is the format of the published histone catalogue and probably what you want.
scons manuscript
The 'catalogue' is a PDF with multiple tables and figures but not embedded in a manuscript context. Catalogue is the default if no target is specified, so the following are equivalent.
scons catalogue
scons
The full sequence download process can take quite some time due to Entrez data access rate limitations. If you have previously downloaded sequence data, your build defaults to the existing sequences. The build does NOT automatically download new data unless you specify 'update'
To force a refresh of the Entrez data include 'update' in the target.
scons update catalogue
scons update manuscript
Note that the sequence release datestamp is shown in the caralogue and manuscript PDFs.
To download only the sequence data without performing any analysis, use the `data' target. This is equivalent to 'update' but without building a catalogue or manuscript. It makes all sequences available in csv format in the results/sequences subdirectory of histone-catalogue.
scons data
Note that the data subdirectory contains certain fixed data required for the builds, not the sequence data.
The Entrez databases searched via E-utilities requires an email address, although this is not enforced. This email is a politeness allowing NCBI staff to contact you in case you accidentally overload their servers.
In order not to overload the E-utility servers, NCBI recommends that users post no more than three URL requests per second and limit large jobs to either weekends or between 9:00 PM and 5:00 AM Eastern time during weekdays. Failure to comply with this policy may result in an IP address being blocked from accessing NCBI. [...] The value of email will be used only to contact developers if NCBI observes requests that violate our policies, and we will attempt such contact prior to blocking access.
For more details, see section "Usage guidelines and requirements", on A General Introduction to the E-utilities.
You should provide the --email option:
scons --email=example@domain.top
The Entrez databases allow faster retrieval of 10 records per second if an API key is included. Users can be obtain this free via the MyNCBI interface. Note that you should also include your email address.
You can include the api key using the --api_key option:
scons --api_key='xxxxxxxxxxxxxxxxxxx' --email='example@domain.top'
A catalogue of human histones is generated by default. Other organisms can be specified using the `--organism' option. This is heavily dependent on the annotation state of the organism reference genome and has only been tested by us for human and mice.
scons --organism='mus musculus'
When you're downloading fresh data from the NCBI servers, you may come across the following (or similar) error:
------------- EXCEPTION: Bio::Root::Exception -------------
MSG: Response Error
Bad Request
STACK: Error::throw
STACK: Bio::Root::Root::throw /usr/share/perl5/Bio/Root/Root.pm:449
STACK: Bio::DB::GenericWebAgent::get_Response /usr/share/perl5/Bio/DB/GenericWebAgent.pm:215
STACK: Bio::DB::EUtilities::get_Response /usr/share/perl5/Bio/DB/EUtilities.pm:34
STACK: main::analyze_entrez_genes /usr/bin/bp_genbank_ref_extractor:239
STACK: main::analyze_entrez_genes /usr/bin/bp_genbank_ref_extractor:312
STACK: /usr/bin/bp_genbank_ref_extractor:177
If you do, then try again, either in the weekend or during the night of Eastern Time zone. Also, ensure you specify an email address and a valid NCBI E-utilities API key.
While technically possible to build this on Windows or MacOS, it is far easier to do it on Linux. If you do not have a Linux system available, the histone-catalogue can be created using Ubuntu on a USB stick or in Virtualbox on a PC or Mac.
A number of software components are required to build the histone catalogue:
bp_genbank_ref_extractor
which is used for search and download of
sequences is part of bioperl's
Bio-EUtilities
distribution.A complete list of required perl modules and latex packages is listed via `scons -h'.
If you are on Linux Debian, or a Debian derivative such as Ubuntu, these are pre-packaged for you. See the following instructions: