bsmn / bsmn-pipeline

BSMN common data processing pipeline
11 stars 10 forks source link

document how to generate sample list file #5

Open kdaily opened 6 years ago

bintriz commented 6 years ago
#!/bin/bash

synapse query "select name, id, sample_id_biorepository, sample_id_original, experiment_id, grant, group, assay, processingKit from syn7871084 where fileFormat='fastq'" \
    |tail -n+2 \
    |cut -f4- \
    |awk -F"\t" '{print $7"\t"$3"-"$4"-"$5"-"$6"\t"$1"\t"$2"\t"$8"\t"$9}' \
    |sort > tmp.fastq.txt

printf "group\tsample_id\tfile\tsynapse_id\tassay\tprocessingKit\n" > tmp.header.txt
{ cat tmp.header.txt; grep 10X tmp.fastq.txt; } > Samples.10X_WGS_fastq.txt
{ cat tmp.header.txt; grep wholeGenomeSeq tmp.fastq.txt |grep -v -e 10X -e '-535-' -e '-797-'; } > Samples.regular_WGS_fastq.txt
{ cat tmp.header.txt; grep -e '-535-' -e '-797-' tmp.fastq.txt; } > Samples.shallow_WGS_fastq.txt
{ cat tmp.header.txt; grep exomeSeq tmp.fastq.txt; } > Samples.WES_fastq.txt
{ cat tmp.header.txt; grep targetedSeq tmp.fastq.txt; } > Samples.Targeted_fastq.txt

rm tmp.fastq.txt

{ cat tmp.header.txt
synapse query "select name, id, sample_id_biorepository, sample_id_original, experiment_id, grant, group, assay, processingKit from syn7871084 where group='Vaccarino' and fileFormat='bam'" \
    |tail -n+2 \
    |cut -f4- \
    |awk -F"\t" '{print $7"\t"$3"-"$4"-"$5"-"$6"\t"$1"\t"$2"\t"$8"\t"$9}' \
    |sort \
    |grep -v 10X
} > Samples.regular_WGS_bam.txt

rm tmp.header.txt

This shell script is what I used to get the sample lists for BSMN ref brain data. Among columns, my pipeline only uses sample_id, file, synapse_id. The order of columns doesn't matter. Of course, this pipeline is pretty specific to BSMN ref brain sample.

kdaily commented 5 years ago

Can you put this in an executable script in this repository, and document it in the README? Then we can close.

kdaily commented 5 years ago

@attilagk it would be great if you can verify for @bintriz that this is sufficient.