Open alessandro-vai opened 1 month ago
Hi Alessandro,
Thank you, I am happy to hear that you find SpikeFlow useful.
Regarding the read split, SpikeFlow does not handle the situation you are pointing out as it is now.
While testing several paired-end samples with Spiker (see here), I noticed that in the quality control table generated (columns: unmapped_reads
, qcfail_reads
, duplicate_reads
, secondary_reads
, low_maq
, diff_genome
), all the values were zeros (or neglectable). diff_genome
is the number of discordant mates, meaning those aligning on different genomes.
The only exception was low_maq
, which had many reads and it could impact the normalization factor calculation. Consequently, I only implemented the mapq filter in SpikeFlow.
Anyway, it might be helpful to keep those metrics in case an experiment goes wrong and you find high levels of discordant mates.
I am working on a new version of SpikeFlow, which will generate the signal tracks for the spike-in for QC purposes (as recently pointed out in this nature commentary. I will also introduce a count and removal of the discordant mates as you suggested, although, as I said, it should not impact that much the final outcomes of the analysis for most ChIP-Rx experiments.
Best,
Davide
Cool, indeed the number of those reads is limited. It was just to let you know.
Good to hear that you are working of novel features!
All the best, Alessandro
From: Davide Bressan @.> Sent: Wednesday, October 2, 2024 11:33:25 AM To: DavideBrex/SpikeFlow @.> Cc: alessandro-vai @.>; Author @.> Subject: Re: [DavideBrex/SpikeFlow] Exogenous Reads Not Being Discarded After Splitting (Issue #22)
Hi Alessandro,
Thank you, I am happy to hear that you find SpikeFlow useful.
Regarding the read split, SpikeFlow does not handle the situation you are pointing out as it is now.
While testing several paired-end samples with Spiker (see herehttps://github.com/liguowang/spiker/blob/main/bin/split_bam.py), I noticed that in the quality control table generated (columns: unmapped_reads, qcfail_reads, duplicate_reads, secondary_reads, low_maq, diff_genome), all the values were zeros (or neglectable). diff_genome is the number of discordant mates, meaning those aligning on different genomes. The only exception was low_maq, which had many reads and it could impact the normalization factor calculation. Consequently, I only implemented the mapq filter in SpikeFlow.
Anyway, it might be helpful to keep those metrics in case an experiment goes wrong and you find high levels of discordant mates.
I am working on a new version of SpikeFlow, which will generate the signal tracks for the spike-in for QC purposes (as recently pointed out in this nature commentaryhttps://www.nature.com/articles/s41587-024-02377-y. I will also introduce a count and removal of the discordant mates as you suggested, although, as I said, it should not impact that much the final outcomes of the analysis for most ChIP-Rx experiments.
Best,
Davide
— Reply to this email directly, view it on GitHubhttps://github.com/DavideBrex/SpikeFlow/issues/22#issuecomment-2388043913, or unsubscribehttps://github.com/notifications/unsubscribe-auth/APKTLVK4XWI36CXWROERGNDZZO4WLAVCNFSM6AAAAABPFBFWYSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOBYGA2DGOJRGM. You are receiving this because you authored the thread.Message ID: @.***>
Hi!
First, I want to say that I really enjoy using SpikeFlow!
I’ve encountered an issue while working with paired-end data. I noticed that some exogenous reads remain in the reference BAM file after the splitting step. Specifically, this seems to happen in cases where the read1 maps to a reference chromosome, but read2 maps to an exogenous chromosome.
Shouldn't these reads be discarded as well?
Thank you in advance.
Alessandro