Closed keyvanelhami closed 3 years ago
This is an issue caused by cg concatenation, spring decompression, or topup. Fastq files are damaged, unsorted, or corrupted.
Transfered to balsamic
I have replied regarding spring decompression before
Fastq integration is not a balsamic issue. CG should send correct fastq files I think. Transferring to cg.
If you want to make this an issue in CG I think you need to update the issue description and the name of the issue which is very specific to balsamic @keyvanelhami @hassanfa . It might be better that you begin by investigating what has happened and then open an issue in CG that describes what needs to be fixed. I'm closing this in the meantime.
I am confused about resistance ;-) Fastq files for toughdingo are corrupted and not sorted. I'd like to keep this open for investigation.
I edited the text body
Ultimately we should open a deviation for this, but that's too much trouble for now. Reanalysis is not Possible at the current state of FastQ files.
Reanalysis is possible manually if files are sorted, by using easily accessible bioinformatics software. Yes it's not a routine task but perhaps it could be implemented to solve this temporarily, instead of losing the files
Of course. It would be great to find the root cause before closing this issue. I am not worried about lost files or anything, I am more worried that it might happen again without us having a clear understanding of what is going on.
Me and @henningonsbring tried to check which fastq-file had the A00689:49:HKWN2DSXX:1:2124:10773:34648:UMI_CCTTT_AACTG"
header and we found it in sample ACC5821A1
:
home/proj/production/demultiplexed-runs/190716_A00689_0049_AHKWN2DSXX/Unaligned-Y151I10I10Y151/Project_568664/Sample_ACC5821A1/HKWN2DSXX_568664_S18_L001_R1_001.fastq.gz @A00689:49:HKWN2DSXX:1:2124:10773:34648 1:N:0:GCACGGACAT+GTCTCGCAAC
When looking at the amount of reads in R1 and R2, there is a clear differences in between them:
[0|0|328] 1d [keyvan.elhami@hasta:~] [P_main] 127 $ zcat /home/proj/production/demultiplexed-runs/190716_A00689_0049_AHKWN2DSXX/Unaligned-Y151I10I10Y151/Project_568664/Sample_ACC5821A1/HKWN2DSXX_568664_S18_L001_R1_001.fastq.gz | wc -l
204800000
[0|0|327] 1d [keyvan.elhami@hasta:~] [P_main] 1m36s $ zcat /home/proj/production/demultiplexed-runs/190716_A00689_0049_AHKWN2DSXX/Unaligned-Y151I10I10Y151/Project_568664/Sample_ACC5821A1/HKWN2DSXX_568664_S18_L001_R2_001.fastq.gz | wc -l
196608000
Could this explain this problem? Do we know why there is a difference between the amount of reads?
Can you check if the fastq IDs are indeed matching this sample? Is there a cross sample contamination?
The number of lines in both files are evenly divided by 4, when looking at the shorter file with "tail" it looks good, and end in a way you expect a fastq-file to end. Even though this quick analysis for the integrity of the fastq-file is not complete, it does not seem like the file is corrupted.
@hassanfa The flow-cell id is written in the fastq-path 190716_A00689_0049_AHKWN2DSXX
and they are the same.
@keyvanelhami Have you looked into fastqc reports if there is any kind of overrepresented sequences/adapter contaminations shown up.
Perhaps you could start by identifying what are the common denominator for the files when this problem occurs. It sounded like there where old cases that had been decompressed from spring, concatenated fastqs and topup.
This case is barely a year old.
Decompressed fastq files have fewer read than old ones.
My theory is that compression is not aware of some reads in read2.
Måns, I think it is better if you take lead on this. You know the whole process in and out. For me it is a complete black box what is going on...
This is a whole genome application, it might affect mip as well at some point. (Hopefully this statement will cause to take this issue seriously 😃)
So just to summarized:
I am not sure how I can help to troubleshoot any more. @moonso @hassanfa let me know if I can do something to find out the root of this problem. Maybe we can discuss this together next week?
Sounded like @Mropat had some input on this as well. I don't know the process in and out @hassanfa but I'll be happy to help out on the compression/decompression part. All tests we have done (me @moahaegglund and @henrikstranneheim ) have showed lossless results.
Process is like this
I have a hard time seeing that the resulting files are corrupted without that being caught in these checks but if anyone has an idea of what might have went wrong we should test that.
I see. So case of files being corrupted is closed.
We should then look into how concatenation works, and why non-topped up sample has mismatching number of reads.
I can't insist enough why we need to get to bottom of this. This is a result of automated process. Hopefully it is because of human error and not a flawed logic in the code.
Could we try to recreate the situation with some smaller test files?
I think ultimately we should do the following:
All fastq-headers start with "@A00689:49:HKWN2DSXX" in both forward and reverse. Did this check to see if some data was concatenated into the bigger file that was not supposed to be there. This suggests the shorter file has lost data.
Could someone list what commands that are are run during concatenation and topup with what parameters? Then it would be fairly easy to construct small tests with some edge cases like duplicated reads etc in some small fastq files.
A second case with similar error and mismatching FastQ reads: hotoyster
BWA MEM complained as following:
[mem_sam_pe] paired reads have different names: "A00621:295:HCT7JDSXY:1:2431:23303:11443:UMI_GGAGT_CTCTT", "A00621:295:HCT7JDSXY:2:1101:18105:1000:UMI_GGAGT_CTCTT"
To make sure BALSAMIC is not the cause, I ran bwa mem
on original FastQ files, and same error:
bwa mem -t 24 -R \
'@RG\tID:concatenated_${LIMS_ID}_XXXXXX_R\tSM:concatenated_${LIMS_ID}_XXXXXX_R\tPL:ILLUMINAi' -M -v 1 \
${PATH_TO_PROD}/cancer/reference/hg19/genome/human_g1k_v37.fasta \
${PATH_TO_PROD}/cancer/cases/hotoyster/fastq/concatenated_${LIMS_ID}_XXXXXX_R_1.fastq.gz \
${PATH_TO_PROD}/cancer/cases/hotoyster/fastq/concatenated_${LIMS_ID}_XXXXXX_R_2.fastq.gz > /dev/null
Output:
[M::mem_pestat] analyzing insert size distribution for orientation FF...
[M::mem_pestat] (25, 50, 75) percentile: (111, 177, 339)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 795)
[M::mem_pestat] mean and std.dev: (214.88, 155.62)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 1023)
...
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 1811)
[M::mem_pestat] mean and std.dev: (388.72, 321.88)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 2354)
[M::mem_pestat] skip orientation FF
[M::mem_pestat] skip orientation RF
[M::mem_pestat] skip orientation RR
[mem_sam_pe] paired reads have different names: "A00621:295:HCT7JDSXY:1:2431:23303:11443", "A00621:295:HCT7JDSXY:2:1101:18105:1000"
This clearly shows there is a flaw in the logic of cg
to provide correct FastQ files to BALSAMIC.
We have right now several Balsamic samples that are affected by this issue and I am unable to run the analysis.
Do we have an update regarding this issue?
@Mropat @patrikgrenfeldt I think you are best suited to look into this, I'm happy to help out if I can. When do you have time? Can you also participate @henningonsbring ?
Let me also know if I could contribute to solve this problem.
I am actually looking at it right now @moonso
@hassanfa since you have bwa ready (environent, index, reference and what ever you might need to run it), can you please test it on the non-concatenated data?
I checked the non-concatenated data, and it looked fine, same number of rows in R1 and R2 (which was not the case for the other analysis with sample ACC5821A1 mentioned above).
Number of rows in the fastq files for hotoyster:
for f in /home/proj/production/demultiplexed-runs/201023_A00621_0295_AHCT7JDSXY/Unaligned-Y151I10I10Y151/Project_821450/Sample_ACC6968A49/*.fastq.gz; do echo $f; zcat $f | wc -l; done
/home/proj/production/demultiplexed-runs/201023_A00621_0295_AHCT7JDSXY/Unaligned-Y151I10I10Y151/Project_821450/Sample_ACC6968A49/HCT7JDSXY_821450_S83_L001_R1_001.fastq.gz
73850724
/home/proj/production/demultiplexed-runs/201023_A00621_0295_AHCT7JDSXY/Unaligned-Y151I10I10Y151/Project_821450/Sample_ACC6968A49/HCT7JDSXY_821450_S83_L001_R2_001.fastq.gz
73850724
/home/proj/production/demultiplexed-runs/201023_A00621_0295_AHCT7JDSXY/Unaligned-Y151I10I10Y151/Project_821450/Sample_ACC6968A49/HCT7JDSXY_821450_S83_L002_R1_001.fastq.gz
78860704
/home/proj/production/demultiplexed-runs/201023_A00621_0295_AHCT7JDSXY/Unaligned-Y151I10I10Y151/Project_821450/Sample_ACC6968A49/HCT7JDSXY_821450_S83_L002_R2_001.fastq.gz
78860704
/home/proj/production/demultiplexed-runs/201023_A00621_0295_AHCT7JDSXY/Unaligned-Y151I10I10Y151/Project_821450/Sample_ACC6968A49/HCT7JDSXY_821450_S83_L003_R1_001.fastq.gz
83361500
/home/proj/production/demultiplexed-runs/201023_A00621_0295_AHCT7JDSXY/Unaligned-Y151I10I10Y151/Project_821450/Sample_ACC6968A49/HCT7JDSXY_821450_S83_L003_R2_001.fastq.gz
83361500
/home/proj/production/demultiplexed-runs/201023_A00621_0295_AHCT7JDSXY/Unaligned-Y151I10I10Y151/Project_821450/Sample_ACC6968A49/HCT7JDSXY_821450_S83_L004_R1_001.fastq.gz
87142468
/home/proj/production/demultiplexed-runs/201023_A00621_0295_AHCT7JDSXY/Unaligned-Y151I10I10Y151/Project_821450/Sample_ACC6968A49/HCT7JDSXY_821450_S83_L004_R2_001.fastq.gz
87142468
Seems like the root of the problem is different for the two cases with the error discussed so far.
Sure. Can you please concatenate them somewhere outside actual prod path? I don't feel comfortable directly working with production files.
maybe here: /home/proj/stage/cancer
I am currently testing whether user error + regex search in file directory could be the reason for this Hasta is currently full so please be patient until tomorrow
@henningonsbring queued the jobs. Now we wait :-)
@Mropat Great that you find the problem. It was compress&decompress&adding to HK.
Maybe we can use the same strategy that Henning is implementing in a PR for BALSAMIC as well? i.e. https://github.com/Clinical-Genomics/cg/pull/841
Tjolahopp! The mystery seems to be solved now!
What I did: Decompress the files for case hotoyster again. Check that all the decompression jobs were completed. Store files in Housekeeper Re-run the analysis from scratch (so concatenation and linking of fastq-files done from scratch)
The case hotoyster ran successfully again with all the reads being in place
I will repeat this for toughdingo again when decompression is done!
The reason for this bug is that we are using regex to find all the files to concatenate in each fastq file's parent path. If decompression is not completely finished, and we try to link the files, cg will find all the matching files (regardless of whether they were all added to Housekeeper, or just one of them)
This should not be happening if we carefully check that all decompression jobs were finished before storing and linking the files! However I will be re-writing some of the code behind this in the coming days for this not to be able to happen in the future
super @Mropat
Case toughdingo was also restarted yesterday following the aforementioned method. Same as before, the new concatenated files had the correct number of reads and there was no problem with alignment this time!
FastQ files for case Toughdingo are corrupted OR they are unsorted. It seems like concatenation step in
cg
is misbehaving probably due to sequence top up.This case has been previously analyzed through BALSAMIC but after decompressing for a reanalysis FastQ files are useless. The text for the original issue is followed at the end of this issue.
Also further investigation shows the read quality metrics are different between old run and new run:
Old run:
ACC5821A1 Read1 before filtering: total reads: 376043577 total bases: 56782580127 Q20 bases: 55798454990(98.2669%) Q30 bases: 54021335826(95.1372%) Read2 before filtering: total reads: 376043577 total bases: 56782580127 Q20 bases: 55298231467(97.3859%) Q30 bases: 53007312076(93.3514%)
New run:
ACC5821A1 Read1 before filtering: total reads: 331728085 total bases: 50090940835 Q20 bases: 49216780646(98.2549%) Q30 bases: 47648562900(95.1241%) Read2 before filtering: total reads: 331728085 total bases: 50090940835 Q20 bases: 48784329729(97.3915%) Q30 bases: 46776083597(93.3823%)
Original issue: