Clinical-Genomics / demultiplexing

To keep scripts associated with execution of the Illumina demultiplexing pipeline
5 stars 0 forks source link

Incorrect status of flowcell and now fastq files are linked twice in HK and read count in cgstats is wrong #149

Closed moahaegglund closed 1 year ago

moahaegglund commented 3 years ago

Flowcell HV7VTCCXY had status removed even though it in reality was on disc (present in the demultiplexed folder but not in HK) from being demultiplexed in 2019. The structure in the demultiplexed folder and the naming of the files has changed leading to the fastq files being included twice in Housekeeper. Screenshot 2021-04-15 at 11 06 08

$ housekeeper get file ACC5113A10
2021-04-15 11:06:57 hasta.scilifelab.se housekeeper.cli.core[246652] INFO Use database mysql+pymysql://housekeeper:X1yek09lW14jG7XU@localhost:3308/housekeeper
2021-04-15 11:06:57 hasta.scilifelab.se housekeeper.cli.core[246652] INFO Use root path /home/proj/production/housekeeper-bundles
2021-04-15 11:06:58 hasta.scilifelab.se housekeeper.store.api.find[246652] INFO Fetching files from bundle ACC5113A10
                                 ๐Ÿ“œ Files table ๐Ÿ“œ
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”“
โ”ƒ ID     โ”ƒ File name                                            โ”ƒ Tags             โ”ƒ
โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ฉ
โ”‚ 850728 โ”‚ HV7VTCCXY-l2t21_640692_AGATCGCA_L002_R2_001.fastq.gz โ”‚ fastq, HV7VTCCXY โ”‚
โ”‚ 850729 โ”‚ HV7VTCCXY-l2t11_640692_AGATCGCA_L002_R1_001.fastq.gz โ”‚ fastq, HV7VTCCXY โ”‚
โ”‚ 850730 โ”‚ HV7VTCCXY-l2t21_640692_AGATCGCA_L002_R1_001.fastq.gz โ”‚ fastq, HV7VTCCXY โ”‚
โ”‚ 850731 โ”‚ HV7VTCCXY-l2t21_640692_S2_L002_R2_001.fastq.gz       โ”‚ fastq, HV7VTCCXY โ”‚
โ”‚ 850732 โ”‚ HV7VTCCXY-l2t11_640692_AGATCGCA_L002_R2_001.fastq.gz โ”‚ fastq, HV7VTCCXY โ”‚
โ”‚ 850733 โ”‚ HV7VTCCXY-l2t21_640692_S2_L002_R1_001.fastq.gz       โ”‚ fastq, HV7VTCCXY โ”‚
โ”‚ 850734 โ”‚ HV7VTCCXY-l2t11_640692_S2_L002_R1_001.fastq.gz       โ”‚ fastq, HV7VTCCXY โ”‚
โ”‚ 850735 โ”‚ HV7VTCCXY-l2t11_640692_S2_L002_R2_001.fastq.gz       โ”‚ fastq, HV7VTCCXY โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

The renaming of files during linking to MIP prevent the files from being included twice in that analysis. (For information I've solve this manually for sample ACC5113A11 while linking undetermined fastq files as I had to start a prio case.)

This has also affected cgstats, samples have the wrong amount of reads:

$ cgstats sample ACC5113A10
1713518192

The sample has 857M reads according to LIMS.

How can we fix this is cgstats and HK? How can we prevent this from happening again?

barrystokman commented 3 years ago

Remove the flowcell (and samples) from cgstats and manually add it again. Remove the files you don't want MIP to use from HK.

As for the incorrect flowcell status:

  1. short term: check hasta before requesting a flowcell from PDC (action: team production)
  2. longer term: find all incorrect flowcell statuses and fix them (action: Team Trocadero)
emmser commented 3 years ago

How can we prevent that the files are being added twice to HK?

barrystokman commented 3 years ago

We need to make sure we don't add them a second time, so before fetching the flowcell from PDC and demuxing again we need to check if the flowcell is actually ondisk. Unfortunately, the flowcell status in statusdb is not always correct. You need to check manually (for now) if the flowcell exists on disk. If that is the case, we don't need to fetch and demux. This is something the production team can do.

Of course doing things manually is not how we want to do things, so we need to check and if needed fix all flowcell statuses. This is something one of the developers will do.

moahaegglund commented 3 years ago

When fetching from PDC works, doesn't it start automatically if MIP tries to start a case where the flowcell of 1 of 3 samples has status removed?

barrystokman commented 3 years ago

That's a good point and I think you are correct.

moahaegglund commented 3 years ago

Yes, I found it here: function all_flowcells_on_disk in https://github.com/Clinical-Genomics/cg/blob/d3c802794ec3d6eb6c3fb327659c9f306dceb0d4/cg/meta/workflow/analysis.py.

moahaegglund commented 3 years ago

The files will not be linked twice to the MIP analysis but in case of a fastq delivery it will be wrong (happened before). I don't know how the other pipelines would handle this.

emmser commented 3 years ago

And also in some cases I've seen that the focus on disk are only partly kept, some Fastq-filrs have been removed while some are still there, in those cases we still want to be able to demux the whole flowcell without adding files twice. Is it impossible to do this check automatically?

emmser commented 3 years ago

Remove the files you don't want MIP to use from HK.

To handle the files manually in this way is not a safe way to do it, it's very easy to do it wrong

moahaegglund commented 3 years ago

I think we should wait to do anything until the prio case fastfalcon is delivered, the index sample is from this flowcell. But would it be possible to delete everything in HK connected to this flowcell, remove it from hasta (/home/proj/production/demultiplexed-runs/), cgstats and redo the demultiplexing? Would that solve this in short term without too much manual work? Or maybe its OK to remove the flowcell from disc? (I don't know if that is possible? I remember we all being confused about the removal of flowcells before summer.)

barrystokman commented 3 years ago

@moahaegglund that works