Proteogenomics / trackhub-creator

Trackhub creation tool
Apache License 2.0
0 stars 1 forks source link

How to set up the application

First thing, check out the application code using the --recursive flag like

git clone --recursive <repo_url>

as this repository uses submodules, some of them will require having the access right for, at least, deploy the module code, otherwise, some of the pipelines shipped with this application will not work.

Once the source code has been checked out, this software counts on a Makefile for doing a lot of DevOps related heavy lifting.

There are two main installation targets:

And two main cleaning (all) targets:

Using the Pipelines shipped with the application

For running any pipeline shipped with the application (or added to it), from the root folder of the application, the following command must be issued

time python_install/bin/python main_app.py -a pipeline_cmd_param1=pipeline_cmd_param1_value,...,pipeline_cmd_paramN=pipeline_cmd_paramN_value <pipeline_name>

This command will time the execution of the application, using the application's Python virtual environment to run the given pipeline with the given command line key=value parameters.

The following pipelines are shipped with the application:

Enemsebl Data Collector Pipeline

Other pipelines shipped with this application, e.g. _create_trackhub_forproject, use Ensembl protein sequence and genome reference files as part of the trackhub creation process, this files are mirrored locally in the application from the latest Ensembl release, as the same application can be running different pipelines in parallel, this pipeline is recommended to be used in order to avoid race conditions mirroring those files.

This pipeline will mirror protein sequence and _genomereference files from Ensembl, for the given list of NCBI Taxonomy IDs, e.g. Mouse and Human as it can be seen beneath this line.

time python_install/bin/python main_app.py -a ncbi_taxonomy_ids=10090,9606 ensembl_data_collector 

Those files will be made locally available at

resources/ensembl/release-XX

within the application folder, where XX is the latest Ensembl Release Number.

There is a launch script specific to PRIDE data, that collects Ensembl data for all the taxonomies present in PRIDE, it can be found at

scripts/ensembl_data_collector

and it can be launched either straight away or as an HPC job

scripts/ensembl_data_collector/launch_pipeline_for_pride_taxonomies.sh 

PRIDE Cluster Export Pipeline

This pipeline creates and registers / updates a trackhub for PRIDE Cluster data.

It is launched by the following script

scripts/pride-cluster-export/ebi-lsf-launch-pipeline.sh

straight away or as a job on the HPC environment.

It will create a subfolder at PRIDE Cluster Trackhubs FTP as 'YYYY-MM', with the year and month information of the trackhub creation, and update a 'latest' link that points to the last created trackhub for PRIDE Cluster.

More information on the process of creating a trackhub for PRIDE Cluster Trackhubs FTP as 'YYYY-MM', with the year and month information of the trackhub creation, and update a 'latest' link that points to the last created trackhub for PRIDE Cluster can be found here.

PRIDE Project Trackhub Creation Pipeline

This pipeline creates a trackhub for the given PRIDE project. It is launched by the script

scripts/create_trackhub_for_project/launch_pipeline_for_project.sh

and the only parameter it needs is the absolute path to a JSON formatted file that contains all the information related to the project being processed and the trackhub that is going to be created, e.g. title, long and short description, etc.

The following is a sample project description file content passed to this pipeline as a parameter

{
  "trackHubName" : "PXD000625",
  "trackHubShortLabel" : "<a href=\"http://www.ebi.ac.uk/pride/archive/projects/PXD000625\">PXD000625</a> - Hepatoc...",
  "trackHubLongLabel" : "Experimental design For the label-free ...",
  "trackHubType" : "PROTEOMICS",
  "trackHubEmail" : "pride-support@ebi.ac.uk",
  "trackHubInternalAbsolutePath" : "...",
  "trackhubCreationReportFilePath": "...",
  "trackMaps" : [ {
    "trackName" : "PXD000625_10090_Original",
    "trackShortLabel" : "<a href=\"http://www.ebi.ac.uk/pride/archive/projects/PXD000625\">PXD000625</a> - Mus musc...",
    "trackLongLabel" : "Experimental design For the label-free proteome analysis 17 mice were used composed of 5 ...",
    "trackSpecies" : "10090",
    "pogoFile" : "..."
  } ]
}

trackhubCreationReportFilePath points to a file where the pipeline, once it is done running, will dump a JSON formatted report on the trackhub creation process, as it can be seen in the sample underneath these lines.

{
"status": "SUCCESS", 
"success_messages": [], 
"warning_messages": [], 
"error_messages": [],
"pipeline_session_working_dir": "...", 
"log_files": [], 
"hub_descriptor_file_path": "..."
}

where

Trackhub Publishing / Registering / Update Pipeline

This pipeline registers a trackhub at Trackhub Registry and it can be launched by the script at

scripts/publish_trackhub/publish_trackhub.sh

providing the following parameters

{
    "trackhubUrl": "http://host.com/hub.txt",
    "publicVisibility": "1",
    "type": "PROTEOMICS",
    "pipelineReportFilePath": "pipeline.report"
}

where

{
"status": "...", 
"success_messages": [],
"warning_messages": [], 
"error_messages": [], 
"pipeline_session_working_dir": "...", 
"trackhub_url": "...", 
"log_files": [], 
"trackhub_registration_analysis": []
}

where

Final Notes

The default Trackhub Registry service used by the pipelines is the one at www.trackhubregistry.org.

Please, for more detailed documentation refer to the wiki pages of this reposiroty.

Contact

Manuel Bernal Llinares