broadinstitute / chem-bio-dos-del

Initiated 2021Q4 for code related to the Broad Chemical Biology DNA-encoded library (DEL) analysis and visualization pipeline
MIT License
1 stars 0 forks source link

handling poor quality LibraryID sequencing: NovaSeq #15

Open tzandi2 opened 1 year ago

tzandi2 commented 1 year ago

As we transition to NovaSeq, we should be aware that sequencing quality of low diversity sequences will deteriorate compared to HiSeq. This is problematic for the libraryID portion of the insert, especially when libraries are not being pooled, or when the pool of libraries is skewed in population towards one library. This is compensated by addition of up to 20% PhiX DNA (high diversity standard sequencing library). Unfortunately, we then lose reads, and thus counts, and thus tight enrichment confidence intervals to PhiX.

Would it be feasible to modify the count generation code with an option to ignore library ID mismatches

That way, for the cases where we do not pool libraries within a given indexed sample, we can use the index sequences themselves to determine barcodes. And if we do pool libraries, we will have high diversity and be able to sequence the library ID. In both cases, we will avoid the need for high concentrations of PhiX during sequencing.

tzandi2 commented 1 year ago

@remontoire-pac @codewarrior2000 @lius-broad @zheryin

See above FYI regarding NovaSeq.

For now, I am going to test hardcoding the known lib_id sequences into my NovaSeq run's fastq files (which have garbage data where the lib_id seq should be) and compare results to a HiSeq run of the same dataset.

tzandi2 commented 1 year ago

The libraryID and the connector sequences between each cycle all suffer from low diversity issues in sequencing. I have written a script that hardcodes in the known libraryID, as well as the connector nucleotides between each cycle, into each sequence in a fastq file. Running these processed fastq files through the pipeline gives valid reads and similar enrichment rank order to a previous HiSeq run of the same dataset.