CGATOxford / CGATPipelines

Collection of CGAT NGS Pipelines
MIT License
43 stars 18 forks source link

Transparent access to data repositories #233

Open IanSudbery opened 8 years ago

IanSudbery commented 8 years ago

Create routines that automate the downloading and processing of remote data from data repositories. This issue will serve as a summary of work done, preliminary documentation, and proposals.

Requirements

Reprocessing data from large consortia can often involve the downloading, renaming and storeing of a large number of large data files. This is both tedious, error prone and can take a very large amount of disk space (TCGA raw data is nearly 100TB).

Specifics

Instead of placing input files in the current or input directory, files named .remote are used. These files are named the same way as normal input files (i.e. TISSUE-CONDITION-REPLICATE or similar), but contain details for accessing the file from a remote repository.

The data file itself is downloaded to the execution node processed and deleted - the raw data is never permanently stored, reducing disk space needed for processing large amounts of remote data.

.remote files

Each .remote file contains a two column table.

  1. The first column contains the repository from which to download the data, currently supported values are SRA, ENA and TCGA.
  2. The second column contains the accession of the file (e.g. SRR1016916 or ERR000916 for SRA/ENA or 21ae315a-a823-40c4-8145-ff5260af3084 for TCGA)
  3. The third column is used only for TCGA files, and is the name of the downloaded file (for some reason TCGA saw fit not to give the files the same name as the accession).

    Security

Downloading of secure/encrypted data is currently supported for SRA and TCGA.

For SRA the pipeline must be executed in a directory underneath the directory setup as the users secure ncbi workspace.

For TCGA the pipeline will look for a file matching the glob gdc-user-token* in the pipeline directory.

Repositories

Both SRA and ENA support download via ascp, a high speed download protocol. SRA only supports the download of SRA files this way, which are reference compressed, and must be extracted. Further, these files are downloaded to the users SRA-cache directory, meaning that they are not automatically removed when the are finished, but must be deleted with a call to Sra.clean_cache() or by running cache-mgr --clean. Dumped fastq files are automatically and SRA files are reference compressed - so smaller than fastq.gz. If does mean that if several tasks use the same file it will only be downloaded once.

ENA supports high-speed download of fastq files. All public files on SRA are also on ENA, so if you are downloading public data, ENA is generally preferred. It is envisaged that SRA will mainly be used for encrypted data.

TCGA does not support ascp, but currently does support fastq download (although this is in danger of being discountinued in favor of BAM only.)

My recommendation is to use ENA where possible.

Implementation

Most of these feature are implemented through additions to the preprocess method of the base SequenceCollectionProcessor class, and so should function transparently with any pipeline that uses PipelineMapping or PipelinePreprocess classes, simply passing .remote files as the infiles to the build method.

Additions have also been made to CGAT.Sra, which now includes prefetch, clean_cache, fetch_ENA, fetch_ENA_files(names) and fetch_TCGA_fastq methods. As this module no longer specifically deals with Sra, perhaps these functions should be moved or the module renamed.

A small number of changes need to be made to pipelines for these to work, mostly in recognizing input files. So far this has been done for pipeline_readqc and pipeline_mapping. It should probably be implemented for pipeline_transacriptdiffexpression shortly.

ENA download currently requires the installation of aspera's ascp, and the setting of the environment variables $ASCP_BIN_PATH and $ASCP_KEY_PATH. SRA download is very much sped up by the installation of ascp.

TCGA download requires installation of gdc-client from the genomic data commons.

IanSudbery commented 8 years ago

branch sudlab/CGATPiplines/IS_remote_access and sudlab/cgat/IS_remote_access_upstream_ready have the first impementations of this.

sebastian-luna-valero commented 8 years ago

Many thanks!

It all looks good to me, but I would be grateful if @CGATOxford/contributors had a look as well before merging.

AndreasHeger commented 8 years ago

@IanSudbery , @sebastian-luna-valero, I think this is a great capability to have, many thanks!