broadinstitute / chem-bio-dos-del

Initiated 2021Q4 for code related to the Broad Chemical Biology DNA-encoded library (DEL) analysis and visualization pipeline
MIT License
1 stars 0 forks source link

Step through generate_stats_table.R on HiSeq log file #5

Open remontoire-pac opened 2 years ago

codewarrior2000 commented 1 year ago

The generate_stats_table.R cript is unable to handle instances when a sample is spread over 8 lanes in HiSeq. It fails to tally the "total reads", "total unique reads", "valid_reads", and "valid_barcodes" for all 8 lanes and consequently incorrectly reports those amounts.

On Mon, Aug 8, 2022 at 3:18 PM Shuang Liu lius@broadinstitute.org wrote: Hi Larry,

In the latest SET Scripps HiSeq stats that Trevor helped generate, I noticed that the counts are greater than the total reads(and sometimes also total unique reads/valid reads), which does not make sense.

stats_table_HiSeq

The trend in the same screen's MiSeq was still normal (counts < total reads/total unique reads/valid reads)

stats_table_MiSeq

I believe the stats for CDoT's HiSeq and Zher Yin's FKBP12 HiSeq also had the same problem. Do you know what might have gone wrong?

Thanks, Shuang

remontoire-pac commented 1 year ago

As we learn about these issues, please make sure they are documented in GitHub (perhaps this is happening already?)!

On Mon, Aug 8, 2022 at 5:10 PM Larry Chung @.***> wrote:

The generate_stats_table.R cript is unable to handle instances when a sample is spread over 8 lanes in HiSeq. It fails to tally the "total reads", "total unique reads", "valid_reads", and "valid_barcodes" for all 8 lanes and consequently incorrectly reports those amounts.

*On Mon, Aug 8, 2022 at 3:18 PM Shuang Liu @. @.> wrote:* Hi Larry,

In the latest SET Scripps HiSeq stats that Trevor helped generate, I noticed that the counts are greater than the total reads(and sometimes also total unique reads/valid reads), which does not make sense.

[image: stats_table_HiSeq] https://user-images.githubusercontent.com/1629353/183512970-d4f0723d-8abc-44cb-8921-168bf5df0da7.png

The trend in the same screen's MiSeq was still normal (counts < total reads/total unique reads/valid reads)

[image: stats_table_MiSeq] https://user-images.githubusercontent.com/1629353/183513001-a3ab98c4-3afa-418b-9417-ce48fc480597.png

I believe the stats for CDoT's HiSeq and Zher Yin's FKBP12 HiSeq also had the same problem. Do you know what might have gone wrong?

Thanks, Shuang

— Reply to this email directly, view it on GitHub https://github.com/broadinstitute/chem-bio-dos-del/issues/5#issuecomment-1208613676, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAQFWTWWNNKBCFFUU5BETVLVYFZULANCNFSM5T4XBZTA . You are receiving this because you authored the thread.Message ID: @.***>

codewarrior2000 commented 1 year ago

As we learn about these issues, please make sure they are documented in GitHub (perhaps this is happening already?)!

Yes, new information about this issue was documented in GitHub an hour before your post.

codewarrior2000 commented 1 year ago

Follow up comment with Zachary Severance's valuable contribution about the matter:

On Mon, Aug 8, 2022 at 7:00 PM Zachary Severance [zseveran@broadinstitute.org](mailto:zseveran@broadinstitute.org) wrote: Hi all,

I think Bruce may have used an additional command to generate the HiSeq stats that concatenated the 8 lanes of HiSeq fastq files for each sample into 1 file for the HiSeq stats generation.

When we ran our DARPA screens the HiSeq stats were correct because I had all the fastq files for each sample concatenated prior to giving them to Larry for counts generation.

In the future, I think we can either cat all the HiSeq lanes for each sample prior to input in the app, or add some type of cat command to the existing MiSeq stats script.

https://www.biostars.org/p/136025/

Thanks, Zach