allenday/gcp-popgen - Githubissues

GCP-PopGen Processing Pipeline

This repository contains a number of Apache Beam pipeline configurations for processing populations of genomes. Populations added to this repository are for 1000s of individuals. Internal tests indicate scaling to 10Ks of individuals should be ok.

The core pipeline itself is implemented in the dataflow-genomics repository, and covers the "secondary" and "tertiary" portions of genome sequence analyses, i.e. the pipeline processes DNA FASTQ files through to VCF files against provided FASTA references.

Processing includes the following steps:

Reading samples metadata from CSV files . See an example here: example-metadata.csv. We're encoding metadata with SRA reads format. You can override this for your own sample metadata needs, see com.google.allenday.genomics.core.model.SampleMetaData.
Downloading run FASTQ files and FASTA reference sequences.
Aligning FASTQ files and create SAM files.
Merging, sorting, and indexing same-sample run SAM into sorted, merged BAM files.
Variant calling BAM files with DeepVariant, producing VCF files.
ETLing VCF files as BigQuery variants schema using gcp-variant-transforms.

PopGen projects

You can find CSV files with project-, sample-, amd run- level metadata in the following project directories:

cannabis-3k - ~3000 Cannabis sativa samples
rice-3k - 3024 Oryza sativa samples
human-1k ~1000 Homo sapiens samples

SRA metadata example

Assay_Type  AvgSpotLen  BioSample   Center_Name DATASTORE_provider  DATASTORE_region    Experiment  InsertSize  Instrument  LibraryLayout   LibrarySelection    LibrarySource   Library_Name    LoadDate    MBases  MBytes  Organism    Platform    ReleaseDate Run SRA_Sample  Sample_Name cultivar    dev_stage   geo_loc_name    tissue  BioProject  BioSampleModel  Consent DATASTORE_filetype  SRA_Study   age
WGS 100 SAMN00738286    UNIVERSITY OF TORONTO   gs s3 sra-sos   gs.US s3.us-east-1 sra-sos.be-md sra-sos.st-va  SRX100361   0   Illumina HiSeq 2000 SINGLE  size fractionation  GENOMIC CS-US_SIL-1 2012-01-21  5205    3505    Cannabis sativa ILLUMINA    2011-10-12  SRR351492   SRS266493   CS-USO31-DNA    USO-31  missing missing young leaf  PRJNA73819  Plant   public  sra SRP008673   missing
WGS 100 SAMN00738286    UNIVERSITY OF TORONTO   gs s3 sra-sos   gs.US s3.us-east-1 sra-sos.be-md sra-sos.st-va  SRX100368   0   Illumina HiSeq 2000 SINGLE  size fractionation  GENOMIC CS-US_SIL-2 2012-01-21  2306    1499    Cannabis sativa ILLUMINA    2011-10-12  SRR351493   SRS266493   CS-USO31-DNA    USO-31  missing missing young leaf  PRJNA73819  Plant   public  sra SRP008673   missing

Preparing data

PROJECT_NAME=(specify_project_name)
SRC_BUCKET_NAME=${PROJECT_NAME}-src
WORKING_BUCKET_NAME=${PROJECT_NAME}-working

The following steps must be performed before you can start data processing:

Create SRC_BUCKET_NAME GCS bucket
Create WORKING_BUCKET_NAME GCS bucket
Annotation CSV must be uploaded to the SRC_BUCKET_NAME
Source FASTQ files must be retrieved (e.g. from NCBI SRA archive) and uploaded to the SRC_BUCKET_NAME
Reference genomes (.fasta or .fa) must be retrieved (e.g. from NCBI Assemple archive)
Reference genomes must indexed using samtools and minimap.
All reference genome files and indexes must be uploaded to the SRC_BUCKET_NAME.

As a result, SRC_BUCKET will have the following structure:

+ ${SRC_BUCKET_NAME}/
    + sra_csv/
        - sra_project_1.csv
        - sra_project_2.csv
        - ...
+ fastq/
    + (sra_study_name)/
        + (sra_sample_nmae)/
            - (sra_read_name_1)_1.fastq
            - (sra_read_name_1)_2.fastq
            - (sra_read_name_2)_1.fastq
            - (sra_read_name_2)_2.fastq
            - (sra_read_name_3)_1.fastq
+ reference/
    + (reference_name)/
        - (reference_name).fa
        - (reference_name).fa.fai

And the following structure will be created in the WORKING_BUCKET_NAME GCS bucket after processing:

+ ${WORKING_BUCKET_NAME}/
    + processing_output/
        + (processing_date)/
            + result_aligned_bam/
                - (read_name_1)_(reference_name).bam
            + result_sorted_bam/
                - (read_name_1)_(reference_name).bam
            + result_merged_bam/
                - (sample_name_1)_(reference_name).bam
                - (sample_name_1)_(reference_name).bam.bai
            + result_dv/
                + (sample_name_1)_(reference_name)/
                    - (sample_name_1)_(reference_name).vcf
                    - (log_filed)
  + dataflow/
    + temp/
        - (dataflow_job_temp_files)
    + staging/
        - (dataflow_job_staging_files)

Running example

Per-sample (fastq => bigquery) processing job:

mvn clean package
java -cp target/gcp-popgen-0.0.2.jar \
    com.google.allenday.popgen.NanostreamBatchApp \
        --project=cannabis-3k \
        --region="us-central1" \
        --workerMachineType=n1-standard-4 \
        --maxNumWorkers=20 \
        --numWorkers=5 \
        --runner=DataflowRunner \
        --srcBucket=cannabis-3k \
        --stagingLocation="gs://dataflow-temp-and-staging/staging/"\
        --tempLocation="gs://dataflow-temp-and-staging/temp/"\
        --inputCsvUri="gs://cannabis-3k/sra_csv_index/converted/*" \
        --sraSamplesToFilter="SRS1098403" \
        --outputGcsUri="gs://cs10-run1-results/processing_output/" \
        --referenceNamesList="AGQN03","MNPR01" \
        --allReferencesDirGcsUri="gs://cannabis-3k/reference/" \
        --memoryOutputLimit=0 \
        --controlPipelineWorkerRegion="us-west2" \
        --stepsWorkerRegion="us-central1" \
        --makeExamplesWorkers=4 \
        --makeExamplesCoresPerWorker=16 \
        --makeExamplesRamPerWorker=60 \
        --makeExamplesDiskPerWorker=90 \
        --callVariantsCoresPerWorker=16 \
        --callVariantsRamPerWorker=60 \
        --callVariantsDiskPerWorker=60 \
        --deepVariantShards=4 \
        --exportVcfToBq=True \
        --vcfBqDatasetAndTablePattern="cannabis_3k_results_test.GENOMICS_VARIATIONS_%s"

Per-sample (fastq => vcf) processing job:

# Simply set  paramater `exportVcfToBq=False` of the fastq=>bigquery pipeline:
--exportVcfToBq=False

Docs in general terms

Running

Use the following commands to process a dataset:

mvn clean package
java -cp target/gcp-popgen-(current_version).jar \
    com.google.allenday.popgen.NanostreamBatchApp \
        --project=(gcp_project_id) \
        --region=(gcp_worker_machines_region) \
        --workerMachineType=(gcp_worker_machines_type) \
        --maxNumWorkers=(max_number_of_workers) \
        --numWorkers=(start_number_of_workers) \
        --runner=(apache_beam_runner) \
        --stagingLocation=(dataflow_gcs_staging_location) \
        --tempLocation=(dataflow_gcs_temp_location) \
        --srcBucket=(all_source_data_gcs_bucket) \
        --inputCsvUri=(gcs_uri_of_sra_csv_file) \ # Could be pattern with wildcards (*) 
        --outputGcsUri=(gcs_uri_for_writing_results) \
        --referenceNamesList=(list_of_references_names_that will_be_used) \
        --allReferencesDirGcsUri=(gcs_uri_of_dir_with_references) \

If VCF results should be exported to BigQuery, add following arguments:

        --exportVcfToBq=True \
        --vcfBqDatasetAndTablePattern=(bq_dataset_and_table_name_pattern)

Also could be added optional parameters:

        --sraSamplesToFilter=(list_of_sra_samles_to_process) \ # Use if need to process some certain samples
        --memoryOutputLimit=(max_file_size_to_pass_internaly_between_transforms) \ # Set 0 to store all intermediante files in GCS \

Optional: DeepVariant parameters:

        --controlPipelineWorkerRegion=(region_of_deep_varint_control_pipeline) \
        --stepsWorkerRegion=(region_of_deep_varint_steps_machines) \
        --makeExamplesWorkers=(number) \
        --makeExamplesCoresPerWorker=(number) \
        --makeExamplesRamPerWorker=(number) \
        --makeExamplesDiskPerWorker=(number) \
        --call_variants_workers=(number)
        --callVariantsCoresPerWorker=(number) \
        --callVariantsRamPerWorker=(number) \
        --callVariantsDiskPerWorker=(number) \
        --postprocess_variants_cores=(number) \
        --postprocess_variants_ram_gb=(number) \
        --postprocess_variants_disk_gb=(number) \
        --preemptible=(bool) \
        --max_premptible_tries=(number) \
        --max_non_premptible_tries=(number) \
        --deepVariantShards=(number) \