Closed lokeshbio closed 4 months ago
Perhaps it will be useful to have this line that I use to do md5sum over multiple processes on LSENS:
module load parallel/20220722
raw=$1
parallel -j8 "md5sum {} >> ${raw}/md5.txt" ::: $(find ${raw}/Data -type f -print)
#forgot that I also need to do this
sed -i s'/\/projects\/.*\/upload\/\w\+/./g' ${raw}/md5.txt
We should probably include what kind of delivery each project requires in the sample_sheet! At the moment, the example samplesheet look like this:
[Yggdrasil_Projects],,,
Project_ID,bcl,fastq,fastq_screen,fastq_screen_ref
2022_000,0,1,0,NA
If we keep adding more pipelines, it is not sustainable to keep adding columns! i would rather have these end deliveries set as Keys like BCL, FASTQ, FASTQ_SCREEN, RNASEQ, METHYLSEQ and so on.. that they are just in one column! then we can have these as binary parameters in Yggdrasil!
#In samplesheet
[Yggdrasil_Projects],,,
Project_ID,Delivery
2022_000,FASTQ
2022_001,RNASEQ
# Then in Yggdrasil nextflow script: we can set these parameters as
params.rnaseq = TRUE #specifically for 2022_001
BCL delivery process needs to be discussed
Hi @chaetognatha , Here is the example of the samplesheet above for a run that could potentially contain different deliveries in the same run. In this above example it is one project with RNASEQ and the other with FASTQ. If we have a test run and a test samplesheet like this, then I can test to run Yggdrasil all the way from raw-data to getting the rnaseq output!
The samplesheet for rawdata deliveries will be called the same as every other, but it will only contain project ID on one line, no commas or anything else, so a file with one line that is the project ID! (there should also be flags to specify both project ID and rawdata delivery, in which case we shouldnt need a samplesheet at all!)