Clinical-Genomics / demultiplexing

To keep scripts associated with execution of the Illumina demultiplexing pipeline
5 stars 0 forks source link

added option where second index i missing #145

Closed karlnyr closed 3 years ago

karlnyr commented 3 years ago

This PR opts novaseq out from using basemask as an argument and lets bcl2fastq determine it from the runfolder. In addition, it adds the correction of basemasks used for hiseqx.

How to prepare for test:

How to test:

Expected test outcome:

Review:

This version is a:

karlnyr commented 3 years ago

Installing on stage: image

Executing test: image

barrystokman commented 3 years ago

So I've been thinking/reading manuals about this.

First of all, the correct base mask for this particular run would be Y51,I6,Y51. We're not actually masking any bases here, and that's where the ns come in. Since this exactly matches the information in the run parameters, we don't even need to specify a base mask for bcl2fastq to use. If the flowcell was sequenced as dual 10 for instance, but also contains samples with single 6 indexes, we would need to demux those separately using the --use-bases-mask Y151,I6n4,n10,Y151 option for bcl2fastq

Second of all, I'd like to know how and when single indexes with length 6 will occur in future runs. If a flowcell has single indexes with length 6 only, and the run parameters reflect this, there's no need for a base mask at all. @AnnaZetterlund @AnnaLeinfelt Do you have any information about this? Can we expect flowcells with single index length 6 only, or are we going back to the days of mixed index flowcells?

Third, regarding flowcells that are sequenced using dual 10 indexes: if any dual length 8 indexes are present, we pad them to length 10 and they would again match the run parameters. A --use-bases-mask option is not even needed to demultiplex them correctly. Same goes for Fluffy flowcells that are dual index length 8, and have matching run parameters.

Long story short: if we can demultiplex all samples on a flowcell in one go, AND their indexes align with the run parameters, we do NOT need to supply bcl2fastq with a --use-bases-mask option.

edit: @moonso you might be interested in this too.

barrystokman commented 3 years ago

For this PR, I would prefer that you remove the --uses-base-mask option from demux-novaseq.sh instead.

karlnyr commented 3 years ago

Agreed, if so is the case (that we can always deduce basemask from runifo) I see no reason to use the basemask but to name the Unaligned dir. I added an option to correctly name the unaligned dir and we are no longer using the basemask in bcl2fastq for novaseq here.

AnnaZetterlund commented 3 years ago

@barrystokman @karlnyr We would not run any new type of mixed index runs what I can think off. I am sorry for this run and the bad heads up. We thought the samples were prepped with dual 8 index until very late in the process and when changing to single 6 forgot to ask you for your input. I would not expect them to run these index again, they will run with another set of dual 8 but I will add them to OF and OP and LIMS and everything should work much better! :)

karlnyr commented 3 years ago

Excellent! If understood @barrystokman correctly this means that we could go forward with this pr for novaseq demux?

karlnyr commented 3 years ago

After update to script on 2021-03-23: image

Correct basemask for hiseqx flowcells: image

Starting demux of novaseq FC: image

Checking projectlog of HW72JDRXX:

[0|0|0] 1d [karl.nyren@hasta:/home/proj/stage] [S_main] 21s $ cat /home/proj/stage/demultiplexed-runs//201214_A00689_0204_AHW72JDRXX/projectlog.20210324160158.log
[20210324160159] On node: cg-dragen.scilifelab.se
[20210324160159] starting, will use /home/karl.nyren/tmp
[20210324160159] Run directory: /home/proj/stage/flowcells/novaseq/runs//201214_A00689_0204_AHW72JDRXX
[20210324160159] Demux directory: /home/proj/stage/demultiplexed-runs//201214_A00689_0204_AHW72JDRXX
[20210324160159] mkdir -p /home/proj/production/flowcells/novaseq/1045762
[20210324160159] start demultiplexing /home/proj/stage/flowcells/novaseq/runs//201214_A00689_0204_AHW72JDRXX
[20210324160159] singularity exec --bind /home/proj/production/demultiplexed-runs,/home/proj/production/flowcells/novaseq,/home/proj/production/flowcells/novaseq/1045762:/run/user/4407 /home/proj/production/demux-on-hasta/novaseq/container/bcl2fastq_v2-20-0.sif bcl2fastq --loading-threads 3 --processing-threads 15 --writing-threads 3 --runfolder-dir /home/proj/stage/flowcells/novaseq/runs//201214_A00689_0204_AHW72JDRXX --output-dir /home/proj/stage/demultiplexed-runs//201214_A00689_0204_AHW72JDRXX/Unaligned --sample-sheet /home/proj/stage/flowcells/novaseq/runs//201214_A00689_0204_AHW72JDRXX/SampleSheet.csv --barcode-mismatches 1
.......................................
[20210324161751] cgstats add --machine novaseq --unaligned Unaligned /home/proj/stage/demultiplexed-runs//201214_A00689_0204_AHW72JDRXX
[2021-03-24 16:17:59,534] INFO    : cgstats.db.cli           : Adding Unaligned.
[20210324161807] cgstats select --project 269990 HW72JDRXX &> /home/proj/stage/demultiplexed-runs//201214_A00689_0204_AHW72JDRXX/stats-269990-HW72JDRXX.txt
2021-03-24 16:18:10 hasta.scilifelab.se demux.utils.indexreport[153162] INFO Parsing file index report for SP FC HW72JDRXX_NIPT_269990, extracting top unknownbarcodes and samples with cluster counts lower than 1000000.
2021-03-24 16:18:10 hasta.scilifelab.se demux.utils.indexreport[153162] INFO Parsing complete!
2021-03-24 16:18:10 hasta.scilifelab.se demux.cli.indexreport[153162] INFO Creating summary of laneBarcode.html for FC: HW72JDRXX_NIPT_269990
2021-03-24 16:18:10 hasta.scilifelab.se demux.utils.indexreport[153162] INFO Validating report
2021-03-24 16:18:10 hasta.scilifelab.se demux.utils.indexreport[153162] INFO Number of report tables: Passed!
2021-03-24 16:18:10 hasta.scilifelab.se demux.utils.indexreport[153162] INFO Sample cluster count headers: Passed!
2021-03-24 16:18:10 hasta.scilifelab.se demux.utils.indexreport[153162] INFO Top Unknown Barcodes table: Passed!
2021-03-24 16:18:10 hasta.scilifelab.se demux.utils.indexreport[153162] INFO Validation passed
2021-03-24 16:18:10 hasta.scilifelab.se demux.utils.indexreport[153162] INFO Wrote indexcheck report summary to /home/proj/stage/demultiplexed-runs/201214_A00689_0204_AHW72JDRXX/laneBarcode_summary.html

Checking the results of HW72JDRXX:

[0|0|0] 1d [karl.nyren@hasta:/home/proj/stage] [S_main] $ ll -h demultiplexed-runs/201214_A00689_0204_AHW72JDRXX/Unaligned/Project_269990/*
demultiplexed-runs/201214_A00689_0204_AHW72JDRXX/Unaligned/Project_269990/Sample_2020-26636-05:
total 1.5G
-rw-rw----+ 1 karl.nyren hasta-development 358M Mar 24 16:09 HW72JDRXX_269990_S1_L001_R1_001.fastq.gz
-rw-rw----+ 1 karl.nyren hasta-development 376M Mar 24 16:09 HW72JDRXX_269990_S1_L001_R2_001.fastq.gz
-rw-rw----+ 1 karl.nyren hasta-development 358M Mar 24 16:17 HW72JDRXX_269990_S1_L002_R1_001.fastq.gz
-rw-rw----+ 1 karl.nyren hasta-development 376M Mar 24 16:17 HW72JDRXX_269990_S1_L002_R2_001.fastq.gz
..............................
demultiplexed-runs/201214_A00689_0204_AHW72JDRXX/Unaligned/Project_269990/Sample_2020-27726-05:
total 1.3G
-rw-rw----+ 1 karl.nyren hasta-development 315M Mar 24 16:09 HW72JDRXX_269990_S46_L001_R1_001.fastq.gz
-rw-rw----+ 1 karl.nyren hasta-development 333M Mar 24 16:09 HW72JDRXX_269990_S46_L001_R2_001.fastq.gz
-rw-rw----+ 1 karl.nyren hasta-development 315M Mar 24 16:17 HW72JDRXX_269990_S46_L002_R1_001.fastq.gz
-rw-rw----+ 1 karl.nyren hasta-development 332M Mar 24 16:17 HW72JDRXX_269990_S46_L002_R2_001.fastq.gz

Starting demux of hiseqx FC: image

Checking the projectlog for HVCFMCCXY:

[0|0|229] 1d [karl.nyren@hasta:/home/proj/stage] [S_main] $ cat /home/proj/stage/demultiplexed-runs/181130_ST-E00269_0322_AHVCFMCCXY/projectlog.20210323133906.log
...................................
[20210323133912] Submitted batch job 1043569
[20210323133912] submit postface
[20210323133912] Running 1043556 1043557 1043558 1043559 1043560 1043561 1043562 1043563 1043564 1043565 1043566 1043567 1043568 1043569 1039621 1039627 1039629 1043198 1043199 1043202 1043207 1043209 1043211 1043212 1043213 1043214 1040159 1040168 1040210 1040214 1040222 1040223 1040227 1040228 1040310 1040319 1043200 1043201 1043203 1043204 1043205 1043206 1043208 1043210 1043215 1043216 1043217 1043218 1043251 1043252 1043253 1043254 1043255 1043256 1043257 1043258 1043364 1043412 1043426 1043427 1043428 1043365 1043366 1043372 1043376 1043377 1043378 1043407 1043413 1043414 1043415 1043424 1043425 1043362 1043363 1043367 1043368 1043379 1043380 1043381 1043382 1043383 1043384 1043385 1043386 1043387 1043388 1043389 1043390 1043391 1043392 1043393 1043394 1043395 1043396 1043397 1043398 1043399 1043400 1043401 1043402 1043403 1043404 1043405 1043406 1043408 1043409 1043410 1043411 1043416 1043417 1043418 1043419 1043420 1043421 1043422 1043423 1043429 1043430 1043431 1043432 1043437 1043554 1043555 1039623 1039614 1040216 1040215 1043197 1040199 1040305 1040206 1040153 1043250 1043370 1043369 1043371 1043349 1043374 1043373 1043375 1043351 1043354 1043361 1043350 1043353 1043352 1043355
[20210323133912] Demux 1043554 1043555 1043556 1043557 1043558 1043559 1043560 1043561 1043562 1043563 1043564 1043565 1043566 1043567 1043568 1043569
[20210323133912] Remaining 1043554 1043555 1043556 1043557 1043558 1043559 1043560 1043561 1043562 1043563 1043564 1043565 1043566 1043567 1043568 1043569
[20210323133912] sbatch -A development -J 'Xdem-postface' --dependency='afterok:1043554:1043555:1043556:1043557:1043558:1043559:1043560:1043561:1043562:1043563:1043564:1043565:1043566:1043567:1043568:1043569' -o '/home/proj/stage/demultiplexed-runs/181130_ST-E00269_0322_AHVCFMCCXY//LOG/xdem-xpostface-HVCFMCCXY-%j.log' -e '/home/proj/stage/demultiplexed-runs/181130_ST-E00269_0322_AHVCFMCCXY//LOG/xdem-xpostface-HVCFMCCXY-%j.err' '/home/proj/stage/bin/git/demultiplexing/scripts/hiseqx/xpostface.batch' '/home/proj/stage/demultiplexed-runs/181130_ST-E00269_0322_AHVCFMCCXY//'
Submitted batch job 1043570
[20210323133912] Everything started

And the results for HVCFMCCXY:

[0|0|229] 1d [karl.nyren@hasta:/home/proj/stage] [S_main] $ ll -h /home/proj/stage/demultiplexed-runs/181130_ST-E00269_0322_AHVCFMCCXY/Unaligned/Project_*/*
...................................
/home/proj/stage/demultiplexed-runs/181130_ST-E00269_0322_AHVCFMCCXY/Unaligned/Project_462297/Sample_ACC5136A7_GCAGAATT:
total 79G
-rw-rw----+ 1 hiseq.clinical hasta-development   18G Mar 23 13:46 HVCFMCCXY-l7t11_462297_S7_L007_R1_001.fastq.gz
-rw-rw----+ 1 hiseq.clinical hasta-development   20G Mar 23 13:46 HVCFMCCXY-l7t11_462297_S7_L007_R2_001.fastq.gz
-rw-rw----+ 1 hiseq.clinical hasta-development  1.4G Mar 23 13:46 HVCFMCCXY-l7t11_Undetermined_S0_L007_R1_001.fastq.gz
-rw-rw----+ 1 hiseq.clinical hasta-development  1.5G Mar 23 13:46 HVCFMCCXY-l7t11_Undetermined_S0_L007_R2_001.fastq.gz
-rw-rw----+ 1 hiseq.clinical hasta-development   18G Mar 23 13:46 HVCFMCCXY-l7t21_462297_S7_L007_R1_001.fastq.gz
-rw-rw----+ 1 hiseq.clinical hasta-development   20G Mar 23 13:46 HVCFMCCXY-l7t21_462297_S7_L007_R2_001.fastq.gz
-rw-rw----+ 1 hiseq.clinical hasta-development  922M Mar 23 13:46 HVCFMCCXY-l7t21_Undetermined_S0_L007_R1_001.fastq.gz
-rw-rw----+ 1 hiseq.clinical hasta-development 1009M Mar 23 13:46 HVCFMCCXY-l7t21_Undetermined_S0_L007_R2_001.fastq.gz

/home/proj/stage/demultiplexed-runs/181130_ST-E00269_0322_AHVCFMCCXY/Unaligned/Project_462297/Sample_ACC5136A8_ATGAGGCC:
total 75G
-rw-rw----+ 1 hiseq.clinical hasta-development  17G Mar 23 13:46 HVCFMCCXY-l8t11_462297_S8_L008_R1_001.fastq.gz
-rw-rw----+ 1 hiseq.clinical hasta-development  19G Mar 23 13:46 HVCFMCCXY-l8t11_462297_S8_L008_R2_001.fastq.gz
-rw-rw----+ 1 hiseq.clinical hasta-development 1.2G Mar 23 13:46 HVCFMCCXY-l8t11_Undetermined_S0_L008_R1_001.fastq.gz
-rw-rw----+ 1 hiseq.clinical hasta-development 1.3G Mar 23 13:46 HVCFMCCXY-l8t11_Undetermined_S0_L008_R2_001.fastq.gz
-rw-rw----+ 1 hiseq.clinical hasta-development  17G Mar 23 13:46 HVCFMCCXY-l8t21_462297_S8_L008_R1_001.fastq.gz
-rw-rw----+ 1 hiseq.clinical hasta-development  19G Mar 23 13:46 HVCFMCCXY-l8t21_462297_S8_L008_R2_001.fastq.gz
-rw-rw----+ 1 hiseq.clinical hasta-development 708M Mar 23 13:46 HVCFMCCXY-l8t21_Undetermined_S0_L008_R1_001.fastq.gz
-rw-rw----+ 1 hiseq.clinical hasta-development 763M Mar 23 13:46 HVCFMCCXY-l8t21_Undetermined_S0_L008_R2_001.fastq.gz
karlnyr commented 3 years ago

@barrystokman is this okay to merge? It would need to go live with this PR as well to work properly

karlnyr commented 3 years ago

Installing on prod: image

image