ENCODE-DCC / atac-seq-pipeline

ENCODE ATAC-seq pipeline
MIT License
380 stars 171 forks source link

Can replicate names be added to the QC reports? #203

Open nicolerg opened 4 years ago

nicolerg commented 4 years ago

Would it be possible to add replicate names (from input FASTQ file names) to the QC reports? Or is there already some setting to do this? At the moment, I cross-reference rep1, rep2, etc. in the QC report ("replicate" column of merged TSV file from qc2tsv) with the input JSON file to be able to link QC to sample names.

leepc12 commented 4 years ago

1) Rename your FASTQ something like rep1.R1.fastq.gz. 2) Make a soft link (with different file name rep1.R1.fastq.gz) of your original FASTQ and use soft links in input JSON.

nicolerg commented 4 years ago

So to be clear, using a soft link instead of a "real" file path results in the pipeline naming replicates by the file name? Is there any way to do this on Google Cloud, where I don't think it is possible to make symbolic links?

leepc12 commented 4 years ago

You cannot make a soft link on GCS. You need to rename it using gsutil mv.

nicolerg commented 4 years ago

I think there's some misunderstanding. My FASTQ file names are already named how I would like them to be named in the QC report, e.g. 90045015504_R1.fastq.gz, where I would like the QC report to say 9007458634 instead of rep1, but the QC JSONs always say rep1, rep2, etc., rather than reflecting the FASTQ file name in any way. Renaming file names to correspond to sample names in GCS with gsutil mv does not result in the JSON QC reports including sample names (instead of repN) either.

For example, part of the input JSON config file:

    "atac.fastqs_rep1_R1" : [
        "/projects/motrpac/PASS1A/ATAC/NOVASEQ_BATCH1/fastq_raw/90045015504_R1.fastq.gz"
    ],

    "atac.fastqs_rep1_R2" : [
        "/projects/motrpac/PASS1A/ATAC/NOVASEQ_BATCH1/fastq_raw/90045015504_R2.fastq.gz"
    ],

    "atac.fastqs_rep2_R1" : [
        "/projects/motrpac/PASS1A/ATAC/NOVASEQ_BATCH1/fastq_raw/90027015504_R1.fastq.gz"
    ],

    "atac.fastqs_rep2_R2" : [
        "/projects/motrpac/PASS1A/ATAC/NOVASEQ_BATCH1/fastq_raw/90027015504_R2.fastq.gz"
    ],

    "atac.fastqs_rep3_R1" : [
        "/projects/motrpac/PASS1A/ATAC/NOVASEQ_BATCH1/fastq_raw/90135015504_R1.fastq.gz"
    ],

    "atac.fastqs_rep3_R2" : [
        "/projects/motrpac/PASS1A/ATAC/NOVASEQ_BATCH1/fastq_raw/90135015504_R2.fastq.gz"
    ],

    "atac.fastqs_rep4_R1" : [
        "/projects/motrpac/PASS1A/ATAC/NOVASEQ_BATCH1/fastq_raw/90117015504_R1.fastq.gz"
    ],

    "atac.fastqs_rep4_R2" : [
        "/projects/motrpac/PASS1A/ATAC/NOVASEQ_BATCH1/fastq_raw/90117015504_R2.fastq.gz"
    ],

    "atac.fastqs_rep5_R1" : [
        "/projects/motrpac/PASS1A/ATAC/NOVASEQ_BATCH1/fastq_raw/90010015504_R1.fastq.gz"
    ],

    "atac.fastqs_rep5_R2" : [
        "/projects/motrpac/PASS1A/ATAC/NOVASEQ_BATCH1/fastq_raw/90010015504_R2.fastq.gz"
    ]

Part of the corresponding QC JSON files:

    "general": {
        "date": "2019-12-04 03:17:33",
        "title": "MoTrPAC PASS1A ATAC - gastroc",
        "description": "Rat-Gastrocnemius-Powder_phase1a_acute_female_0h",
        "pipeline_ver": "v1.5.4",
        "pipeline_type": "atac",
        "genome": "motrpac_rn6",
        "aligner": "bowtie2",
        "seq_endedness": {
            "rep1": {
                "paired_end": true
            },
            "rep2": {
                "paired_end": true
            },
            "rep3": {
                "paired_end": true
            },
            "rep4": {
                "paired_end": true
            },
            "rep5": {
                "paired_end": true
            }
        },
        "peak_caller": "macs2"
    },
    "align": {
        "samstat": {
            "rep1": {
                ...
            },
            "rep2": {
                ...
            },
            "rep3": {

As far as I can tell, the input FASTQ file names are in no way reflected in the output QC reports. This is true on both SCG and GCS. Right now I run a script to parse replicate names from the BAM file in the shard-? subdirectories of call-align and merge the resulting map with the QC report.

leepc12 commented 4 years ago

I see. This isn't possible for the current pipeline.

I'd recommend to make a simple Python script to convert the original qc.json to something like qc.named.json. Replace repN with whatever you want. You can parse your input JSON itself to get a mapping from repN to a corresponding sample name.

ljmills commented 3 years ago

Can you change the naming scheme in how you indicate the FASTQ files? atac.fastqs_rep1_R1 -> atac.fastqs_90045015504_R1 atac.fastqs_rep1_R2 -> atac.fastqs_90045015504_R2

leepc12 commented 3 years ago

@ljmills Sorry we can't change the naming scheme. You can manually replace strings in QC repots (HTML and JSON) with a proper reg-ex (e.g. rep1 -> 90045015504).