Clinical-Genomics / demultiplexing

To keep scripts associated with execution of the Illumina demultiplexing pipeline
5 stars 0 forks source link

feat(linking of undetermined files) #153

Closed karlnyr closed 3 years ago

karlnyr commented 3 years ago

This PR fixes

How to prepare for test:

How to test:

Expected test outcome:

Review:

This version is a:

karlnyr commented 3 years ago

Executing test on 2 lanes: image Observing the results of 2 lanes: image Fastq files were generated, I am a bit concerned that all samples have Undetermined reads connected to them. Executing test on 8 lanes: image Observing the results of 8 lanes: image Fastq files were generated, still same situation were Undetermined are related to every sample.

karlnyr commented 3 years ago

@barrystokman I believe that this is ready now for review, I am also going to make some new tests on testing the workflow when we create an Undetermined directory

barrystokman commented 3 years ago

I'm having a hard time seeing which test is which, what the expected outcomes are and if they have passed or not. Can you make that a bit clearer? From your wording it seems that two tests failed (2 and 8 lanes) and one test (pooled) has not been executed?

karlnyr commented 3 years ago

I'm having a hard time seeing which test is which, what the expected outcomes are and if they have passed or not. Can you make that a bit clearer? From your wording it seems that two tests failed (2 and 8 lanes) and one test (pooled) has not been executed?

Yes you are correct. Both of the tests above were before I implemented the change of undetermined files. I noticed the undetermined files issue in the tests and decided to add a fix to it as well. I have not performed the new tests just yet, got caught up in some production stuff. I can @ you again when the new tests are performed 😄

moahaegglund commented 3 years ago

I see now that the Y was to be added here, but yes I think it's good to deploy the other PR with only that so that the demux of HISeqX works. :) A comment on this PR, 2 lanes = HiSeq 2500 and 8 lanes = HiSeqX.

karlnyr commented 3 years ago

@AnnaZetterlund @moahaegglund Is it correct that 2500 are 2 lanes and hiseqx are 8? In that case this PR was kind of a blunder since we could move the flowcells affected from the hiseqx directory to the 2500 and it would hopefully work.

@barrystokman Do you know what makes 2500's end up in hiseqx dir on hasta?

vwirta commented 3 years ago

@AnnaZetterlund @moahaegglund Is it correct that 2500 are 2 lanes and hiseqx are 8? In that case this PR was kind of a blunder since we could move the flowcells affected from the hiseqx directory to the 2500 and it would hopefully work.

@barrystokman Do you know what makes 2500's end up in hiseqx dir on hasta?

HiSeq X is always 8 lanes, while HiSeq 2500 is either 2 lanes (rapid mode flow cells) or 8 lanes (high output mode flow cells).

barrystokman commented 3 years ago

I need a bit more information if I'm going to investigate flowcells being in the wrong dir :)

What are the full run names of the involved flowcells?

moahaegglund commented 3 years ago

It's one flowcell that's in the wrong directory: /home/proj/production/flowcells/hiseqx/150119_D00450_0151_AHB1T6ADXX We have fetched 8 other 2500 flowcells and they all ended up in the correct location.

karlnyr commented 3 years ago

If that is the case, the part regarding lane counts can be scrapped in this PR. I do however believe that the undetermined files van be a solution to the previous issue you encountered @moahaegglund ?

@vwirta @henrikstranneheim I am also considering removing the undetermined files of pooled samples to save space, thoughts?

vwirta commented 3 years ago

The only reason for having the undetermined reads would be to be able to look for samples not included in the sample sheet. Otherwise I have no strong arguments for keeping them.

henrikstranneheim commented 3 years ago

@karlnyr I think it would be good to delete them

moahaegglund commented 3 years ago

I've put in a project suggestion to redo all demultiplexing of HiSeqX flowcells as there is errors in Housekeeper and cgstats too connected to these flowcells above all this with the undetermined fastq files. If we are to remove all folders and redo it eventually I don't think we should put time into cleaning up in the existing folders.

I also think that we should think this through before doing changes on the demultiplexing in order to get it correct. The fastq files was previously linked to Unaligned but this changed during 2019, a guess is that it was changed in order to match with Novaseq. This linked fastq files doesn't work with spring/Housekeeper. The check if it's pooled or not has to be per lane as this differs between lanes on some flowcells. I suggest we have a meeting to discuss how the demultiplexing should be changed when there is time to start working on this project.

karlnyr commented 3 years ago

I've put in a project suggestion to redo all demultiplexing of HiSeqX flowcells as there is errors in Housekeeper and cgstats too connected to these flowcells above all this with the undetermined fastq files. If we are to remove all folders and redo it eventually I don't think we should put time into cleaning up in the existing folders.

I also think that we should think this through before doing changes on the demultiplexing in order to get it correct. The fastq files was previously linked to Unaligned but this changed during 2019, a guess is that it was changed in order to match with Novaseq. This linked fastq files doesn't work with spring/Housekeeper. The check if it's pooled or not has to be per lane as this differs between lanes on some flowcells. I suggest we have a meeting to discuss how the demultiplexing should be changed when there is time to start working on this project.

I agree with this, and the changes would be minor to link on single sample lanes and not link on singe sample lanes.

karlnyr commented 3 years ago

New tests for linking and no linking of undetermined files:

Pooled - HFH5HALXX image

Non-Pooled - HVCFMCCXY image

Pooled lanes on flowcells gets their undetermined files correctly moved into the Undetermined folder. Non-pooled lanes get their undetermined files correctly linked to their respective sample.

karlnyr commented 3 years ago

@moahaegglund @barrystokman Despite no real request to work on this now I have fixed the linking of the undetermined files. This way we will not link pooled lanes, and if we decide to remove the Undetermined we could simply remove the final part of xdemuxtiles.batch in order to only link pooled lanes and scrap the rest of the files.

henrikstranneheim commented 3 years ago

@karlnyr Great work! Is this also connected to project: https://github.com/Clinical-Genomics/project-planning/issues/252?

karlnyr commented 3 years ago

I would say so yes!

karlnyr commented 3 years ago

@barrystokman I believe this one is now finished as well. I was finally able to satisfy the linting action as well, so many edits in anything but demux/utils/samplesheet and demux/cli/samplesheet/ and the tests are for linting reasons

karlnyr commented 3 years ago

deployed: image