FastQ file corrupted OR unsorted and useless for reanalysis Original title: Failed analysis on Sention

keyvanelhami commented 3 years ago

FastQ files for case Toughdingo are corrupted OR they are unsorted. It seems like concatenation step in cg is misbehaving probably due to sequence top up.

This case has been previously analyzed through BALSAMIC but after decompressing for a reanalysis FastQ files are useless. The text for the original issue is followed at the end of this issue.

Also further investigation shows the read quality metrics are different between old run and new run:

Old run:

ACC5821A1 Read1 before filtering: total reads: 376043577 total bases: 56782580127 Q20 bases: 55798454990(98.2669%) Q30 bases: 54021335826(95.1372%) Read2 before filtering: total reads: 376043577 total bases: 56782580127 Q20 bases: 55298231467(97.3859%) Q30 bases: 53007312076(93.3514%)

New run:

ACC5821A1 Read1 before filtering: total reads: 331728085 total bases: 50090940835 Q20 bases: 49216780646(98.2549%) Q30 bases: 47648562900(95.1241%) Read2 before filtering: total reads: 331728085 total bases: 50090940835 Q20 bases: 48784329729(97.3915%) Q30 bases: 46776083597(93.3823%)

Original issue:

**Case or sample information**
Internal sample id or case id: toughdingo
Amount read-pairs: 375M for Tumor and 350 for normal
Sequence type (WES, TGA, WGS): WGS

**Error message**
`[mem_sam_pe] [mem_sam_pe] [mem_sam_pe] paired reads have different names: "A00689:49:HKWN2DSXX:1:2124:10773:34648:UMI_CCTTT_AACTG", "A00689:49:HKWN2DSXX:2:1101:1054:1000:UMI_CCTTT_AACTG"`

**Link to the error message**
`/home/proj/production/cancer/cases/toughdingo/logs/BALSAMIC.toughdingo.sentieon_align_sort.2.sh_701720.err`

**Reason for failure**
UMI?

**If workflow, which rules**
`rule sentieon_align_sort` 

**Version (please complete the following information):**
`6.0.1`

**Additional context**
I have tried to restart the analysis and the same error appeared again

hassanfa commented 3 years ago

This is an issue caused by cg concatenation, spring decompression, or topup. Fastq files are damaged, unsorted, or corrupted.

moonso commented 3 years ago

Transfered to balsamic

moonso commented 3 years ago

I have replied regarding spring decompression before

hassanfa commented 3 years ago

Fastq integration is not a balsamic issue. CG should send correct fastq files I think. Transferring to cg.

moonso commented 3 years ago

If you want to make this an issue in CG I think you need to update the issue description and the name of the issue which is very specific to balsamic @keyvanelhami @hassanfa . It might be better that you begin by investigating what has happened and then open an issue in CG that describes what needs to be fixed. I'm closing this in the meantime.

hassanfa commented 3 years ago

I am confused about resistance ;-) Fastq files for toughdingo are corrupted and not sorted. I'd like to keep this open for investigation.

hassanfa commented 3 years ago

I edited the text body

hassanfa commented 3 years ago

Ultimately we should open a deviation for this, but that's too much trouble for now. Reanalysis is not Possible at the current state of FastQ files.

Mropat commented 3 years ago

Reanalysis is possible manually if files are sorted, by using easily accessible bioinformatics software. Yes it's not a routine task but perhaps it could be implemented to solve this temporarily, instead of losing the files

hassanfa commented 3 years ago

Of course. It would be great to find the root cause before closing this issue. I am not worried about lost files or anything, I am more worried that it might happen again without us having a clear understanding of what is going on.

keyvanelhami commented 3 years ago

Me and @henningonsbring tried to check which fastq-file had the A00689:49:HKWN2DSXX:1:2124:10773:34648:UMI_CCTTT_AACTG" header and we found it in sample ACC5821A1:

home/proj/production/demultiplexed-runs/190716_A00689_0049_AHKWN2DSXX/Unaligned-Y151I10I10Y151/Project_568664/Sample_ACC5821A1/HKWN2DSXX_568664_S18_L001_R1_001.fastq.gz @A00689:49:HKWN2DSXX:1:2124:10773:34648 1:N:0:GCACGGACAT+GTCTCGCAAC

When looking at the amount of reads in R1 and R2, there is a clear differences in between them:

[0|0|328] 1d [keyvan.elhami@hasta:~] [P_main] 127 $ zcat /home/proj/production/demultiplexed-runs/190716_A00689_0049_AHKWN2DSXX/Unaligned-Y151I10I10Y151/Project_568664/Sample_ACC5821A1/HKWN2DSXX_568664_S18_L001_R1_001.fastq.gz | wc -l
204800000
[0|0|327] 1d [keyvan.elhami@hasta:~] [P_main] 1m36s $ zcat /home/proj/production/demultiplexed-runs/190716_A00689_0049_AHKWN2DSXX/Unaligned-Y151I10I10Y151/Project_568664/Sample_ACC5821A1/HKWN2DSXX_568664_S18_L001_R2_001.fastq.gz | wc -l
196608000

Could this explain this problem? Do we know why there is a difference between the amount of reads?

hassanfa commented 3 years ago

Can you check if the fastq IDs are indeed matching this sample? Is there a cross sample contamination?

henningonsbring commented 3 years ago

The number of lines in both files are evenly divided by 4, when looking at the shorter file with "tail" it looks good, and end in a way you expect a fastq-file to end. Even though this quick analysis for the integrity of the fastq-file is not complete, it does not seem like the file is corrupted.

keyvanelhami commented 3 years ago

@hassanfa The flow-cell id is written in the fastq-path 190716_A00689_0049_AHKWN2DSXX and they are the same.

ashwini06 commented 3 years ago

@keyvanelhami Have you looked into fastqc reports if there is any kind of overrepresented sequences/adapter contaminations shown up.

moonso commented 3 years ago

Perhaps you could start by identifying what are the common denominator for the files when this problem occurs. It sounded like there where old cases that had been decompressed from spring, concatenated fastqs and topup.

hassanfa commented 3 years ago

This case is barely a year old.

Decompressed fastq files have fewer read than old ones.

My theory is that compression is not aware of some reads in read2.

Måns, I think it is better if you take lead on this. You know the whole process in and out. For me it is a complete black box what is going on...

This is a whole genome application, it might affect mip as well at some point. (Hopefully this statement will cause to take this issue seriously 😃)

keyvanelhami commented 3 years ago

So just to summarized:

This case has two samples, one nomal and one tumor
ACC5838A1 has been sequenced twice (top-up) where 16 fastq-files (8 from each sequencing run) are available. The R1 and R2 for each of the fastq-file pairs have exactly the same number of reads.
ACC5821A1 has been sequenced once where 8 fastq-files. The R1 and R2 for one of the fastq-files doesn't have the same number of reads (as I wrote above). It's the same read where the sention error message came from. The remaining fatq-files have the same nr of reads for R1 and R2.

I am not sure how I can help to troubleshoot any more. @moonso @hassanfa let me know if I can do something to find out the root of this problem. Maybe we can discuss this together next week?

moonso commented 3 years ago

Sounded like @Mropat had some input on this as well. I don't know the process in and out @hassanfa but I'll be happy to help out on the compression/decompression part. All tests we have done (me @moahaegglund and @henrikstranneheim ) have showed lossless results.

Process is like this

fastq1.fq + fastq2.fq -> spring
decompress spring -> fastq1.spring.fq, fastq2.spring.fq
Check integrity of fastq1.fq and fastq1.spring.fq with sha256. Same for read 2
If files are identical remove all fastq files and save the checksums
When decompressing spring compare the fastqs to previous checksum to ensure integrity

I have a hard time seeing that the resulting files are corrupted without that being caught in these checks but if anyone has an idea of what might have went wrong we should test that.

hassanfa commented 3 years ago

I see. So case of files being corrupted is closed.

We should then look into how concatenation works, and why non-topped up sample has mismatching number of reads.

I can't insist enough why we need to get to bottom of this. This is a result of automated process. Hopefully it is because of human error and not a flawed logic in the code.

moonso commented 3 years ago

Could we try to recreate the situation with some smaller test files?

hassanfa commented 3 years ago

I think ultimately we should do the following:

demux the original flowcell(s) and store the reads.
Follow same compression and decompression logic and find out where in the process we are losing reads in read1/2.
Repeat same process in current housekeeper/cg flow.

henningonsbring commented 3 years ago

All fastq-headers start with "@A00689:49:HKWN2DSXX" in both forward and reverse. Did this check to see if some data was concatenated into the bigger file that was not supposed to be there. This suggests the shorter file has lost data.

moonso commented 3 years ago

Could someone list what commands that are are run during concatenation and topup with what parameters? Then it would be fairly easy to construct small tests with some edge cases like duplicated reads etc in some small fastq files.

hassanfa commented 3 years ago

A second case with similar error and mismatching FastQ reads: hotoyster

BWA MEM complained as following:

[mem_sam_pe] paired reads have different names: "A00621:295:HCT7JDSXY:1:2431:23303:11443:UMI_GGAGT_CTCTT", "A00621:295:HCT7JDSXY:2:1101:18105:1000:UMI_GGAGT_CTCTT"

hassanfa commented 3 years ago

To make sure BALSAMIC is not the cause, I ran bwa mem on original FastQ files, and same error:

bwa mem -t 24 -R  \
'@RG\tID:concatenated_${LIMS_ID}_XXXXXX_R\tSM:concatenated_${LIMS_ID}_XXXXXX_R\tPL:ILLUMINAi' -M -v 1 \
${PATH_TO_PROD}/cancer/reference/hg19/genome/human_g1k_v37.fasta \
${PATH_TO_PROD}/cancer/cases/hotoyster/fastq/concatenated_${LIMS_ID}_XXXXXX_R_1.fastq.gz \
${PATH_TO_PROD}/cancer/cases/hotoyster/fastq/concatenated_${LIMS_ID}_XXXXXX_R_2.fastq.gz > /dev/null

Output:

[M::mem_pestat] analyzing insert size distribution for orientation FF...
[M::mem_pestat] (25, 50, 75) percentile: (111, 177, 339)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 795)
[M::mem_pestat] mean and std.dev: (214.88, 155.62)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 1023)
...
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 1811)
[M::mem_pestat] mean and std.dev: (388.72, 321.88)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 2354)
[M::mem_pestat] skip orientation FF
[M::mem_pestat] skip orientation RF
[M::mem_pestat] skip orientation RR
[mem_sam_pe] paired reads have different names: "A00621:295:HCT7JDSXY:1:2431:23303:11443", "A00621:295:HCT7JDSXY:2:1101:18105:1000"

This clearly shows there is a flaw in the logic of cg to provide correct FastQ files to BALSAMIC.

keyvanelhami commented 3 years ago

We have right now several Balsamic samples that are affected by this issue and I am unable to run the analysis.

Do we have an update regarding this issue?

moonso commented 3 years ago

@Mropat @patrikgrenfeldt I think you are best suited to look into this, I'm happy to help out if I can. When do you have time? Can you also participate @henningonsbring ?

keyvanelhami commented 3 years ago

Let me also know if I could contribute to solve this problem.

henningonsbring commented 3 years ago

I am actually looking at it right now @moonso

henningonsbring commented 3 years ago

@hassanfa since you have bwa ready (environent, index, reference and what ever you might need to run it), can you please test it on the non-concatenated data?

I checked the non-concatenated data, and it looked fine, same number of rows in R1 and R2 (which was not the case for the other analysis with sample ACC5821A1 mentioned above).

Number of rows in the fastq files for hotoyster:

for f in /home/proj/production/demultiplexed-runs/201023_A00621_0295_AHCT7JDSXY/Unaligned-Y151I10I10Y151/Project_821450/Sample_ACC6968A49/*.fastq.gz; do echo $f; zcat $f | wc -l; done
/home/proj/production/demultiplexed-runs/201023_A00621_0295_AHCT7JDSXY/Unaligned-Y151I10I10Y151/Project_821450/Sample_ACC6968A49/HCT7JDSXY_821450_S83_L001_R1_001.fastq.gz
73850724
/home/proj/production/demultiplexed-runs/201023_A00621_0295_AHCT7JDSXY/Unaligned-Y151I10I10Y151/Project_821450/Sample_ACC6968A49/HCT7JDSXY_821450_S83_L001_R2_001.fastq.gz
73850724
/home/proj/production/demultiplexed-runs/201023_A00621_0295_AHCT7JDSXY/Unaligned-Y151I10I10Y151/Project_821450/Sample_ACC6968A49/HCT7JDSXY_821450_S83_L002_R1_001.fastq.gz
78860704
/home/proj/production/demultiplexed-runs/201023_A00621_0295_AHCT7JDSXY/Unaligned-Y151I10I10Y151/Project_821450/Sample_ACC6968A49/HCT7JDSXY_821450_S83_L002_R2_001.fastq.gz
78860704
/home/proj/production/demultiplexed-runs/201023_A00621_0295_AHCT7JDSXY/Unaligned-Y151I10I10Y151/Project_821450/Sample_ACC6968A49/HCT7JDSXY_821450_S83_L003_R1_001.fastq.gz
83361500
/home/proj/production/demultiplexed-runs/201023_A00621_0295_AHCT7JDSXY/Unaligned-Y151I10I10Y151/Project_821450/Sample_ACC6968A49/HCT7JDSXY_821450_S83_L003_R2_001.fastq.gz
83361500
/home/proj/production/demultiplexed-runs/201023_A00621_0295_AHCT7JDSXY/Unaligned-Y151I10I10Y151/Project_821450/Sample_ACC6968A49/HCT7JDSXY_821450_S83_L004_R1_001.fastq.gz
87142468
/home/proj/production/demultiplexed-runs/201023_A00621_0295_AHCT7JDSXY/Unaligned-Y151I10I10Y151/Project_821450/Sample_ACC6968A49/HCT7JDSXY_821450_S83_L004_R2_001.fastq.gz
87142468

Seems like the root of the problem is different for the two cases with the error discussed so far.

hassanfa commented 3 years ago

Sure. Can you please concatenate them somewhere outside actual prod path? I don't feel comfortable directly working with production files.

hassanfa commented 3 years ago

maybe here: /home/proj/stage/cancer

Mropat commented 3 years ago

I am currently testing whether user error + regex search in file directory could be the reason for this Hasta is currently full so please be patient until tomorrow

hassanfa commented 3 years ago

@henningonsbring queued the jobs. Now we wait :-)

hassanfa commented 3 years ago

@Mropat Great that you find the problem. It was compress&decompress&adding to HK.

Maybe we can use the same strategy that Henning is implementing in a PR for BALSAMIC as well? i.e. https://github.com/Clinical-Genomics/cg/pull/841

Mropat commented 3 years ago

Tjolahopp! The mystery seems to be solved now!

What I did: Decompress the files for case hotoyster again. Check that all the decompression jobs were completed. Store files in Housekeeper Re-run the analysis from scratch (so concatenation and linking of fastq-files done from scratch)

The case hotoyster ran successfully again with all the reads being in place

I will repeat this for toughdingo again when decompression is done!

The reason for this bug is that we are using regex to find all the files to concatenate in each fastq file's parent path. If decompression is not completely finished, and we try to link the files, cg will find all the matching files (regardless of whether they were all added to Housekeeper, or just one of them)

This should not be happening if we carefully check that all decompression jobs were finished before storing and linking the files! However I will be re-writing some of the code behind this in the coming days for this not to be able to happen in the future

vwirta commented 3 years ago

super @Mropat

Mropat commented 3 years ago

Case toughdingo was also restarted yesterday following the aforementioned method. Same as before, the new concatenated files had the correct number of reads and there was no problem with alignment this time!

Clinical-Genomics / cg

FastQ file corrupted OR unsorted and useless for reanalysis Original title: Failed analysis on Sention #809

Old run:

New run: