broadinstitute / CODECsuite

analysis pipeline for CODEC data
Other
9 stars 6 forks source link

All "LOW_CONF" reads #3

Closed wclee47 closed 1 year ago

wclee47 commented 1 year ago

Hi, I ran the trim and obtained the following output:

LOST_BOTH: 2009756 LOST_READ1: 1310988 LOST_READ2: 1738581 HIGH_CONF: 0 LOW_CONF: 481239847 BOTH_UNTRIMMED: 0 READ1_UNTRIMMED: 0 READ2_UNTRIMMED: 0 SINGLE_INSERT: 0 SINGLE_INSERT_LOWCONF: 0 DOUBLE_LIGATION: 0 TOTAL: 486299172

It seems all reads are considered "LOW_CONF". Is it low confidence? How can the confidence be determined? Is it okay to use this result?

Please give me an advice.

Thanks,

Won-Chul

ruolin commented 1 year ago

Nice to see you have ran CODEC. We should no longer have LOW_CONF in our current workflow. I need to gather a few informations. How were the sample multiplexed? What sample barcodes did you use? Can you show me the output of the demultiplexed steps? Basically the Adapter trimming step will use the sample barcode to figure out if a read-pair has a correct structure or not.

wclee47 commented 1 year ago

Thanks ruolin.

The output of the demultiplexing step was like this:

sample_A,GAGCCTACTCAGTCAACG,GTGTCGAACACTTGACGG sample_B,CTTGAACGGACTGTCCAC,CACCGAGCGTTAGACTAC

sample, matched, matched%: sample_A, 486299172, 0.999968

sample, matched, matched%: sample_B, 15539, 3.19526e-05

total, #PF, #matched, matched%: 1040775800, 1040775800, 486314711, 0.467262

The CODEC sample was pooled with other samples having regular illumina adapter structure, and the CODEC data was believed to be dumped into "Undetermined" reads in the bcl2fastq step because it cannot be demultiplexed. Then we used the "Undetermined" reads as input for CODECsuite.

Any hint from that?

Thank you.

ruolin commented 1 year ago

No problem. Happy to help with your issues.

We have never pool CODEC libraries with standard_NGS in one flowcell. CODEC has no sample index reads and we just don't know what will happened during Illumina's index cycles. So when we have CODEC-only flowcell, we actually disabled the index cycles and use extra cycles for sequencing the reads. For example for Illumina NovaSeq S4, we get 2x166bp reads. With that being said, I think your approach might work. But I need to look at the reads you have. Would you mind sharing some of the demultiplexed reads from sample_A? I also wonder why sample_B has very little reads?

wclee47 commented 1 year ago

Sure, I've just sent the files (top 1000 reads) to your gmail address. I should have removed sample_B from the sample sheet. We only used sample_A in the run.

Could you please take a look at the reads? Many thanks.

Won-Chul

ruolin commented 1 year ago

Hi Won-Chul, thanks for sharing the data. I think I found the problem. There was illumina sample barcoded that was added before the CODEC sample barcode. From the fastq name, you can see the last 18 characters is the CODEC barcode. I will have a fix soon but want to give you an update.

wclee47 commented 1 year ago

Oh, it's great. It totally makes sense. As you know, the data went through bcl2fastq step once and that's why there are illumina sample barcodes already in the header line. I'll wait for your update to the code. Please let me know when it's done. Thanks a lot! :)

ruolin commented 1 year ago

I have a fix which is here. You need to apply it to the demux step which I haven't tested actually (because I don't have the input for demux). But I think this will work. Let me know how it goes.

wclee47 commented 1 year ago

Ruolin, thanks so much for this quick update!!! Let me test it again and close the ticket if it works.

wclee47 commented 1 year ago

It worked. Thanks a lot for your help!