Open IanSudbery opened 8 years ago
branch sudlab/CGATPiplines/IS_remote_access and sudlab/cgat/IS_remote_access_upstream_ready have the first impementations of this.
Many thanks!
It all looks good to me, but I would be grateful if @CGATOxford/contributors had a look as well before merging.
@IanSudbery , @sebastian-luna-valero, I think this is a great capability to have, many thanks!
Create routines that automate the downloading and processing of remote data from data repositories. This issue will serve as a summary of work done, preliminary documentation, and proposals.
Requirements
Reprocessing data from large consortia can often involve the downloading, renaming and storeing of a large number of large data files. This is both tedious, error prone and can take a very large amount of disk space (TCGA raw data is nearly 100TB).
Specifics
Remote Access
Instead of placing input files in the current or input directory, files named
.remote
are used. These files are named the same way as normal input files (i.e. TISSUE-CONDITION-REPLICATE or similar), but contain details for accessing the file from a remote repository.The data file itself is downloaded to the execution node processed and deleted - the raw data is never permanently stored, reducing disk space needed for processing large amounts of remote data.
.remote
filesEach .remote file contains a two column table.
SRA
,ENA
andTCGA
.Security
Downloading of secure/encrypted data is currently supported for
SRA
andTCGA
.For
SRA
the pipeline must be executed in a directory underneath the directory setup as the users secure ncbi workspace.For
TCGA
the pipeline will look for a file matching the globgdc-user-token*
in the pipeline directory.Repositories
Both SRA and ENA support download via ascp, a high speed download protocol.
SRA
only supports the download of SRA files this way, which are reference compressed, and must be extracted. Further, these files are downloaded to the users SRA-cache directory, meaning that they are not automatically removed when the are finished, but must be deleted with a call to Sra.clean_cache() or by runningcache-mgr --clean
. Dumped fastq files are automatically and SRA files are reference compressed - so smaller than fastq.gz. If does mean that if several tasks use the same file it will only be downloaded once.ENA
supports high-speed download offastq
files. All public files on SRA are also on ENA, so if you are downloading public data, ENA is generally preferred. It is envisaged that SRA will mainly be used for encrypted data.TCGA
does not support ascp, but currently does supportfastq
download (although this is in danger of being discountinued in favor of BAM only.)My recommendation is to use
ENA
where possible.Implementation
Most of these feature are implemented through additions to the
preprocess
method of the baseSequenceCollectionProcessor
class, and so should function transparently with any pipeline that uses PipelineMapping or PipelinePreprocess classes, simply passing.remote
files as the infiles to thebuild
method.Additions have also been made to
CGAT.Sra
, which now includesprefetch
,clean_cache
,fetch_ENA
,fetch_ENA_files
(names) andfetch_TCGA_fastq
methods. As this module no longer specifically deals with Sra, perhaps these functions should be moved or the module renamed.A small number of changes need to be made to pipelines for these to work, mostly in recognizing input files. So far this has been done for
pipeline_readqc
andpipeline_mapping
. It should probably be implemented forpipeline_transacriptdiffexpression
shortly.Requirements
ENA download currently requires the installation of aspera's
ascp
, and the setting of the environment variables$ASCP_BIN_PATH
and$ASCP_KEY_PATH
. SRA download is very much sped up by the installation ofascp
.TCGA download requires installation of gdc-client from the genomic data commons.