Closed moahaegglund closed 1 year ago
@karlnyr @emmser @barrystokman @b4ckm4n @talnor @henningonsbring Please comment if you have any suggestions for the naming of the fastq files.
I think the new suggestion is good. ๐ I we go any further we should connect devs of all pipelines so that they can input if the new format would impact their workflow.
@Clinical-Genomics/bioinfo Please provide some feedback on this issue
Right now the sample bundles could look like this:
๐ Files table ๐
โโโโโโโโโโณโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโณโโโโโโโโโโโโโโโโโโโ
โ ID โ File name โ Tags โ
โกโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฉ
โ 146945 โ /home/proj/production/demultiplexed-runs/160623_ST-E00269_0094_AHNCF2CCXX/Unaligned/Project_652212/Sample_ADM1588A1_dual9/HNCF2CCXX-l2t21_Undetermined_TCCGGAGA_L002_R1_001.fastq.gz โ fastq, HNCF2CCXX โ
โ 146946 โ /home/proj/production/demultiplexed-runs/160623_ST-E00269_0094_AHNCF2CCXX/Unaligned/Project_652212/Sample_ADM1588A1_dual9/HNCF2CCXX-l2t11_Undetermined_TCCGGAGA_L002_R1_001.fastq.gz โ fastq, HNCF2CCXX โ
โ 146947 โ /home/proj/production/demultiplexed-runs/160623_ST-E00269_0094_AHNCF2CCXX/Unaligned/Project_652212/Sample_ADM1588A1_dual9/HNCF2CCXX-l2t11_652212_TCCGGAGA_L002_R2_001.fastq.gz โ fastq, HNCF2CCXX โ
โ 146948 โ /home/proj/production/demultiplexed-runs/160623_ST-E00269_0094_AHNCF2CCXX/Unaligned/Project_652212/Sample_ADM1588A1_dual9/HNCF2CCXX-l2t21_652212_TCCGGAGA_L002_R2_001.fastq.gz โ fastq, HNCF2CCXX โ
โ 146949 โ /home/proj/production/demultiplexed-runs/160623_ST-E00269_0094_AHNCF2CCXX/Unaligned/Project_652212/Sample_ADM1588A1_dual9/HNCF2CCXX-l2t21_Undetermined_TCCGGAGA_L002_R2_001.fastq.gz โ fastq, HNCF2CCXX โ
โ 146950 โ /home/proj/production/demultiplexed-runs/160623_ST-E00269_0094_AHNCF2CCXX/Unaligned/Project_652212/Sample_ADM1588A1_dual9/HNCF2CCXX-l2t21_652212_TCCGGAGA_L002_R1_001.fastq.gz โ fastq, HNCF2CCXX โ
โ 146951 โ /home/proj/production/demultiplexed-runs/160623_ST-E00269_0094_AHNCF2CCXX/Unaligned/Project_652212/Sample_ADM1588A1_dual9/HNCF2CCXX-l2t11_Undetermined_TCCGGAGA_L002_R2_001.fastq.gz โ fastq, HNCF2CCXX โ
โ 146952 โ /home/proj/production/demultiplexed-runs/160623_ST-E00269_0094_AHNCF2CCXX/Unaligned/Project_652212/Sample_ADM1588A1_dual9/HNCF2CCXX-l2t11_652212_TCCGGAGA_L002_R1_001.fastq.gz โ fastq, HNCF2CCXX โ
โโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโ
I think using the same naming convention is a good idea, but then l2t11 and l2t21 reads should be directed into the same fastq file during demultiplexing. Otherwise using flowcellid_sampleid_etc... will produce non-unique file names. Also the trailing 001 is found on all output files right now and if we were to remove this, we would need to validate both cg workflow start and every pipeline as there might be some custom handling of file names which might be affected
Hi! I agree with @Mropat that we should definitely test all pipelines with their validation cases using the new naming conventions before we do any permanent changes. As I see it, the suggestion made by Moa should carry the information needed - In cases where tile information is stored I believe it would be possible to simply concatenate the two together.
I want to rekindle this flame again, so I am pinging: @Clinical-Genomics/bioinfo Do you see now any issues with the changes Moa suggested above?
We could also add this to the agenda for the next operations meeting for further discussions.
If we decide to change this I believe an combination of the fastqhandlers way to set the name of the current fastq: https://github.com/Clinical-Genomics/cg/blob/master/cg/meta/workflow/fastq.py
and the renaming file function used in the demultiplexing api: https://github.com/Clinical-Genomics/cg/blob/master/cg/meta/demultiplex/files.py
could be used
As far as I know, the rare disease nextflow doesn't parse the file name so we are good here. @rannick - confirming here that the read group snippet of the code doesn't parse the file name?
The read group takes the whole R1 fastq file name, so any naming convention should work
BALSAMIC on the other hand seems to require FASTQNAME_R_[1,2].fastq.gz
. I don't know if there is any renaming from cg. @ashwini06 @ivadym ?
BALSAMIC on the other hand seems to require
FASTQNAME_R_[1,2].fastq.gz
. I don't know if there is any renaming from cg. @ashwini06 @ivadym ?
Indeed, BALSAMIC requires a specific naming format to differentiate read1 and read2.
General required input format is to start BALSAMIC is : [fastq_dir]/[samplename]_[1,2].fastq.gz
. I am not sure if renaming is handled by cg while linking fastq files..
For both mutant and microSALT, cg gives the fastqs new names during linking for analysis. Changing the naming convention of demux files should not affect microSALT or mutant, unless cg linking is affected.
The fastq files from NovaSeq are now named in the following way:
<flowcellID>_<sample-name>_<line-number-in-samplesheet>_<lane-number>_<R1/R2>_001.fastq.gz
Sample name = the customers name of the sample. I suggest that we change the naming to the following:<flowcellID>_<sampleID>_<lane-number>_<R1/R2>.fastq.gz
We need the lane number and R1/R2 in order to separate the 8 fastq files, but i guess there is no need for the trailing _001?Here is some examples of the names of fastq files from HiSeqX:
HJCGKALXX-l1t11_999226_S3_L001_R1_001.fastq.gz
HJCGKALXX-l1t11_Undetermined_S0_L001_R1_001.fastq.gz
I suggest to change it to the same naming as for NovaSeq, but with _Undetermined if needed.Do you think the fastq files should be named in another way? Do we need any more information in the names?