Arcadia-Science / seqqc

A Nextflow pipeline to identify quality control issues with new sequencing data.
MIT License
28 stars 0 forks source link

draft of data upload instructions for cron job #28

Closed taylorreiter closed 1 year ago

taylorreiter commented 1 year ago

New sequencing data should be uploaded to the Arcadia Science S3 bucket arcadia-seqqc. Each data set should be uploaded into a new folder in the indir. Create a new folder in arcadia-seqqc/indir. The folder should be named <year>-<initials>-<descriptor> where descriptor is an up to 10 character descriptor of your sequencing data. For example, if I created a folder, the full path would be S3://arcadia-seqqc/indir/2023-ter-timecheese. New sequencing data needs to be in gzipped FASTQ format (*fq.gz, *fastq.gz) and can be single or paired end format. You also need to create and upload a CSV samplesheet that documents your sample names and the paths to the data in the S3 bucket. See these instructions for how to create and format a sample sheet.

Each (day, week?) the cron job will check whether there is new data in arcadia-seqqc/indir. If there is, the cron job will run the seqqc pipeline on the FASTQ files. All FASTQ files in one indir (e.g. S3://arcadia-seqqc/indir/2023-ter-timecheese) will be run together at the same time. When the pipeline is finished running, the output files will be available in S3://arcadia-seqqc/outdir/2023-ter-timecheese. You will also receive an email with the quality control report attached. This report contains inline documentation for how to interpret the results. If your data are good quality, you can now upload them to the European Nucleotide Archive.