Open nicolerg opened 4 years ago
1) Rename your FASTQ something like rep1.R1.fastq.gz.
2) Make a soft link (with different file name rep1.R1.fastq.gz
) of your original FASTQ and use soft links in input JSON.
So to be clear, using a soft link instead of a "real" file path results in the pipeline naming replicates by the file name? Is there any way to do this on Google Cloud, where I don't think it is possible to make symbolic links?
You cannot make a soft link on GCS. You need to rename it using gsutil mv
.
I think there's some misunderstanding. My FASTQ file names are already named how I would like them to be named in the QC report, e.g. 90045015504_R1.fastq.gz
, where I would like the QC report to say 9007458634
instead of rep1
, but the QC JSONs always say rep1
, rep2
, etc., rather than reflecting the FASTQ file name in any way. Renaming file names to correspond to sample names in GCS with gsutil mv
does not result in the JSON QC reports including sample names (instead of repN
) either.
For example, part of the input JSON config file:
"atac.fastqs_rep1_R1" : [
"/projects/motrpac/PASS1A/ATAC/NOVASEQ_BATCH1/fastq_raw/90045015504_R1.fastq.gz"
],
"atac.fastqs_rep1_R2" : [
"/projects/motrpac/PASS1A/ATAC/NOVASEQ_BATCH1/fastq_raw/90045015504_R2.fastq.gz"
],
"atac.fastqs_rep2_R1" : [
"/projects/motrpac/PASS1A/ATAC/NOVASEQ_BATCH1/fastq_raw/90027015504_R1.fastq.gz"
],
"atac.fastqs_rep2_R2" : [
"/projects/motrpac/PASS1A/ATAC/NOVASEQ_BATCH1/fastq_raw/90027015504_R2.fastq.gz"
],
"atac.fastqs_rep3_R1" : [
"/projects/motrpac/PASS1A/ATAC/NOVASEQ_BATCH1/fastq_raw/90135015504_R1.fastq.gz"
],
"atac.fastqs_rep3_R2" : [
"/projects/motrpac/PASS1A/ATAC/NOVASEQ_BATCH1/fastq_raw/90135015504_R2.fastq.gz"
],
"atac.fastqs_rep4_R1" : [
"/projects/motrpac/PASS1A/ATAC/NOVASEQ_BATCH1/fastq_raw/90117015504_R1.fastq.gz"
],
"atac.fastqs_rep4_R2" : [
"/projects/motrpac/PASS1A/ATAC/NOVASEQ_BATCH1/fastq_raw/90117015504_R2.fastq.gz"
],
"atac.fastqs_rep5_R1" : [
"/projects/motrpac/PASS1A/ATAC/NOVASEQ_BATCH1/fastq_raw/90010015504_R1.fastq.gz"
],
"atac.fastqs_rep5_R2" : [
"/projects/motrpac/PASS1A/ATAC/NOVASEQ_BATCH1/fastq_raw/90010015504_R2.fastq.gz"
]
Part of the corresponding QC JSON files:
"general": {
"date": "2019-12-04 03:17:33",
"title": "MoTrPAC PASS1A ATAC - gastroc",
"description": "Rat-Gastrocnemius-Powder_phase1a_acute_female_0h",
"pipeline_ver": "v1.5.4",
"pipeline_type": "atac",
"genome": "motrpac_rn6",
"aligner": "bowtie2",
"seq_endedness": {
"rep1": {
"paired_end": true
},
"rep2": {
"paired_end": true
},
"rep3": {
"paired_end": true
},
"rep4": {
"paired_end": true
},
"rep5": {
"paired_end": true
}
},
"peak_caller": "macs2"
},
"align": {
"samstat": {
"rep1": {
...
},
"rep2": {
...
},
"rep3": {
As far as I can tell, the input FASTQ file names are in no way reflected in the output QC reports.
This is true on both SCG and GCS. Right now I run a script to parse replicate names from the BAM file in the shard-?
subdirectories of call-align
and merge the resulting map with the QC report.
I see. This isn't possible for the current pipeline.
I'd recommend to make a simple Python script to convert the original qc.json
to something like qc.named.json
. Replace repN
with whatever you want. You can parse your input JSON itself to get a mapping from repN
to a corresponding sample name.
Can you change the naming scheme in how you indicate the FASTQ files? atac.fastqs_rep1_R1 -> atac.fastqs_90045015504_R1 atac.fastqs_rep1_R2 -> atac.fastqs_90045015504_R2
@ljmills Sorry we can't change the naming scheme. You can manually replace strings in QC repots (HTML and JSON) with a proper reg-ex (e.g. rep1 -> 90045015504).
Would it be possible to add replicate names (from input FASTQ file names) to the QC reports? Or is there already some setting to do this? At the moment, I cross-reference
rep1
,rep2
, etc. in the QC report ("replicate" column of merged TSV file fromqc2tsv
) with the input JSON file to be able to link QC to sample names.