af-lab / histone-catalogue

Core histone catalogue --- Live manuscript
1 stars 0 forks source link

Histone Catalogue

This project provides a catalogue of canonical core histone genes, encoded proteins, and pseudogenes based on reference genome annotations. It also provides context on the variation of histone properties, isoforms, clusters, and their nomenclature.

Since curation and annotation are dynamic and evolving, the catalogue generates a live publication that can provide always up-to-date information in an accessible format.

Inspired by the ideals of reproducible research, this project contains all the code required to automate and create a new build of the catalogue from current annotations. This is based on SCons, an automated software build system.

Instructions

To build the catalogue:

  1. Install linux and software dependencies
  2. Pull the histone-catalogue project from github
  3. Run scons with the build target from the histone-catalogue directory

Running scons

Running scons will check that all required software is installed, search the databases for the histone genes, download all required sequences, analyze the sequences, generate figures and compile the catalogue in PDF format as required.

Example command to generate a fully updated manuscript PDF.

scons \
    --api_key='xxxxxxxxxxxxxxxxxxx' \
    --email='example@domain.top' \
    update manuscript

For a complete list of targets and options.

scons -h

Choosing the build target

target as manuscript

The 'manuscript' comprises all tables and figures as a PDF embedded in a contextual discussion of canonical histone genes and proteins. Some additional analyses are also included.

This is the format of the published histone catalogue and probably what you want.

scons manuscript

target as catalogue

The 'catalogue' is a PDF with multiple tables and figures but not embedded in a manuscript context. Catalogue is the default if no target is specified, so the following are equivalent.

scons catalogue
scons

target including update

The full sequence download process can take quite some time due to Entrez data access rate limitations. If you have previously downloaded sequence data, your build defaults to the existing sequences. The build does NOT automatically download new data unless you specify 'update'

To force a refresh of the Entrez data include 'update' in the target.

scons update catalogue
scons update manuscript

Note that the sequence release datestamp is shown in the caralogue and manuscript PDFs.

target as data

To download only the sequence data without performing any analysis, use the `data' target. This is equivalent to 'update' but without building a catalogue or manuscript. It makes all sequences available in csv format in the results/sequences subdirectory of histone-catalogue.

scons data

Note that the data subdirectory contains certain fixed data required for the builds, not the sequence data.

Additional options

email

The Entrez databases searched via E-utilities requires an email address, although this is not enforced. This email is a politeness allowing NCBI staff to contact you in case you accidentally overload their servers.

In order not to overload the E-utility servers, NCBI recommends that users post no more than three URL requests per second and limit large jobs to either weekends or between 9:00 PM and 5:00 AM Eastern time during weekdays. Failure to comply with this policy may result in an IP address being blocked from accessing NCBI. [...] The value of email will be used only to contact developers if NCBI observes requests that violate our policies, and we will attempt such contact prior to blocking access.

For more details, see section "Usage guidelines and requirements", on A General Introduction to the E-utilities.

You should provide the --email option:

scons --email=example@domain.top

api key

The Entrez databases allow faster retrieval of 10 records per second if an API key is included. Users can be obtain this free via the MyNCBI interface. Note that you should also include your email address.

You can include the api key using the --api_key option:

scons --api_key='xxxxxxxxxxxxxxxxxxx' --email='example@domain.top'

organism

A catalogue of human histones is generated by default. Other organisms can be specified using the `--organism' option. This is heavily dependent on the annotation state of the organism reference genome and has only been tested by us for human and mice.

scons --organism='mus musculus'

Error downloading data

When you're downloading fresh data from the NCBI servers, you may come across the following (or similar) error:

------------- EXCEPTION: Bio::Root::Exception -------------
MSG: Response Error
Bad Request
STACK: Error::throw
STACK: Bio::Root::Root::throw /usr/share/perl5/Bio/Root/Root.pm:449
STACK: Bio::DB::GenericWebAgent::get_Response /usr/share/perl5/Bio/DB/GenericWebAgent.pm:215
STACK: Bio::DB::EUtilities::get_Response /usr/share/perl5/Bio/DB/EUtilities.pm:34
STACK: main::analyze_entrez_genes /usr/bin/bp_genbank_ref_extractor:239
STACK: main::analyze_entrez_genes /usr/bin/bp_genbank_ref_extractor:312
STACK: /usr/bin/bp_genbank_ref_extractor:177

If you do, then try again, either in the weekend or during the night of Eastern Time zone. Also, ensure you specify an email address and a valid NCBI E-utilities API key.

Installing Linux and software dependencies

While technically possible to build this on Windows or MacOS, it is far easier to do it on Linux. If you do not have a Linux system available, the histone-catalogue can be created using Ubuntu on a USB stick or in Virtualbox on a PC or Mac.

Dependencies

A number of software components are required to build the histone catalogue:

A complete list of required perl modules and latex packages is listed via `scons -h'.

If you are on Linux Debian, or a Debian derivative such as Ubuntu, these are pre-packaged for you. See the following instructions:

Directory structure