Standardise the fastq file naming convention

moahaegglund commented 3 years ago

The fastq files from NovaSeq are now named in the following way: <flowcellID>_<sample-name>_<line-number-in-samplesheet>_<lane-number>_<R1/R2>_001.fastq.gz Sample name = the customers name of the sample. I suggest that we change the naming to the following: <flowcellID>_<sampleID>_<lane-number>_<R1/R2>.fastq.gz We need the lane number and R1/R2 in order to separate the 8 fastq files, but i guess there is no need for the trailing _001?

Here is some examples of the names of fastq files from HiSeqX: HJCGKALXX-l1t11_999226_S3_L001_R1_001.fastq.gz HJCGKALXX-l1t11_Undetermined_S0_L001_R1_001.fastq.gz I suggest to change it to the same naming as for NovaSeq, but with _Undetermined if needed.

Do you think the fastq files should be named in another way? Do we need any more information in the names?

moahaegglund commented 3 years ago

@karlnyr @emmser @barrystokman @b4ckm4n @talnor @henningonsbring Please comment if you have any suggestions for the naming of the fastq files.

karlnyr commented 3 years ago

I think the new suggestion is good. 👍 I we go any further we should connect devs of all pipelines so that they can input if the new format would impact their workflow.

henrikstranneheim commented 3 years ago

@Clinical-Genomics/bioinfo Please provide some feedback on this issue

Mropat commented 3 years ago

Right now the sample bundles could look like this:

                                                                                                📜 Files table 📜                                                                                                  
┏━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
┃ ID     ┃ File name                                                                                                                                                                            ┃ Tags             ┃
┡━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
│ 146945 │ /home/proj/production/demultiplexed-runs/160623_ST-E00269_0094_AHNCF2CCXX/Unaligned/Project_652212/Sample_ADM1588A1_dual9/HNCF2CCXX-l2t21_Undetermined_TCCGGAGA_L002_R1_001.fastq.gz │ fastq, HNCF2CCXX │
│ 146946 │ /home/proj/production/demultiplexed-runs/160623_ST-E00269_0094_AHNCF2CCXX/Unaligned/Project_652212/Sample_ADM1588A1_dual9/HNCF2CCXX-l2t11_Undetermined_TCCGGAGA_L002_R1_001.fastq.gz │ fastq, HNCF2CCXX │
│ 146947 │ /home/proj/production/demultiplexed-runs/160623_ST-E00269_0094_AHNCF2CCXX/Unaligned/Project_652212/Sample_ADM1588A1_dual9/HNCF2CCXX-l2t11_652212_TCCGGAGA_L002_R2_001.fastq.gz       │ fastq, HNCF2CCXX │
│ 146948 │ /home/proj/production/demultiplexed-runs/160623_ST-E00269_0094_AHNCF2CCXX/Unaligned/Project_652212/Sample_ADM1588A1_dual9/HNCF2CCXX-l2t21_652212_TCCGGAGA_L002_R2_001.fastq.gz       │ fastq, HNCF2CCXX │
│ 146949 │ /home/proj/production/demultiplexed-runs/160623_ST-E00269_0094_AHNCF2CCXX/Unaligned/Project_652212/Sample_ADM1588A1_dual9/HNCF2CCXX-l2t21_Undetermined_TCCGGAGA_L002_R2_001.fastq.gz │ fastq, HNCF2CCXX │
│ 146950 │ /home/proj/production/demultiplexed-runs/160623_ST-E00269_0094_AHNCF2CCXX/Unaligned/Project_652212/Sample_ADM1588A1_dual9/HNCF2CCXX-l2t21_652212_TCCGGAGA_L002_R1_001.fastq.gz       │ fastq, HNCF2CCXX │
│ 146951 │ /home/proj/production/demultiplexed-runs/160623_ST-E00269_0094_AHNCF2CCXX/Unaligned/Project_652212/Sample_ADM1588A1_dual9/HNCF2CCXX-l2t11_Undetermined_TCCGGAGA_L002_R2_001.fastq.gz │ fastq, HNCF2CCXX │
│ 146952 │ /home/proj/production/demultiplexed-runs/160623_ST-E00269_0094_AHNCF2CCXX/Unaligned/Project_652212/Sample_ADM1588A1_dual9/HNCF2CCXX-l2t11_652212_TCCGGAGA_L002_R1_001.fastq.gz       │ fastq, HNCF2CCXX │
└────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴──────────────────┘

I think using the same naming convention is a good idea, but then l2t11 and l2t21 reads should be directed into the same fastq file during demultiplexing. Otherwise using flowcellid_sampleid_etc... will produce non-unique file names. Also the trailing 001 is found on all output files right now and if we were to remove this, we would need to validate both cg workflow start and every pipeline as there might be some custom handling of file names which might be affected

karlnyr commented 2 years ago

Hi! I agree with @Mropat that we should definitely test all pipelines with their validation cases using the new naming conventions before we do any permanent changes. As I see it, the suggestion made by Moa should carry the information needed - In cases where tile information is stored I believe it would be possible to simply concatenate the two together.

I want to rekindle this flame again, so I am pinging: @Clinical-Genomics/bioinfo Do you see now any issues with the changes Moa suggested above?

henrikstranneheim commented 2 years ago

We could also add this to the agenda for the next operations meeting for further discussions.

karlnyr commented 2 years ago

If we decide to change this I believe an combination of the fastqhandlers way to set the name of the current fastq: https://github.com/Clinical-Genomics/cg/blob/master/cg/meta/workflow/fastq.py

and the renaming file function used in the demultiplexing api: https://github.com/Clinical-Genomics/cg/blob/master/cg/meta/demultiplex/files.py

could be used

projectoriented commented 2 years ago

As far as I know, the rare disease nextflow doesn't parse the file name so we are good here. @rannick - confirming here that the read group snippet of the code doesn't parse the file name?

rannick commented 2 years ago

The read group takes the whole R1 fastq file name, so any naming convention should work

rannick commented 2 years ago

BALSAMIC on the other hand seems to require FASTQNAME_R_[1,2].fastq.gz. I don't know if there is any renaming from cg. @ashwini06 @ivadym ?

ashwini06 commented 2 years ago

BALSAMIC on the other hand seems to require FASTQNAME_R_[1,2].fastq.gz. I don't know if there is any renaming from cg. @ashwini06 @ivadym ?

Indeed, BALSAMIC requires a specific naming format to differentiate read1 and read2. General required input format is to start BALSAMIC is : [fastq_dir]/[samplename]_[1,2].fastq.gz. I am not sure if renaming is handled by cg while linking fastq files..

talnor commented 2 years ago

For both mutant and microSALT, cg gives the fastqs new names during linking for analysis. Changing the naming convention of demux files should not affect microSALT or mutant, unless cg linking is affected.

Clinical-Genomics / demultiplexing

Standardise the fastq file naming convention #157