IHEC / ihec-assay-standards

This repo is for code and documentation associated with the ihec-assay-standards working group
Apache License 2.0
5 stars 5 forks source link

ChIP metrics total_reads count #6

Open paulstretenowich opened 5 years ago

paulstretenowich commented 5 years ago

@sitag @dbujold

The _totalreads count is done by parsing the flagstat file and corresponds to the "in total" value but if we don't substract the "secondary" value we can end up with more reads than what are in the fastq files. I faced that issue in one sample I'm analyzing. I suggest to substract the "secondary" value to the "in total" value to have the "real" _totalreads count?

Paul

sitag commented 5 years ago

@paulstretenowich Which flagstats? I though alignment in ENCODE pipeline was aln mode?

paulstretenowich commented 5 years ago

It's in calculate_ChIP-seq_QC_metrics.bash lines 100 and 103.

sitag commented 5 years ago

Okay. This is not encode pipeline. How are the bams aligned? If I remember right then encode uses bwa aln to align data. aln mode does not have supplementary i think. So I am confused where your bams are coming from.

sitag commented 5 years ago

if you are using bams aligned with a different pipeline then you need to subtract supplementary alignments. supplementary reads may not even be included in flagstats if you use and old samtools

paulstretenowich commented 5 years ago

In our Pipelines (GenPipes), we are calculating IHEC metrics based on calculate_ChIP-seq_QC_metrics.bash for ChIP-Seq. We are aligning using bwa-mem and have some "secondary" reads but none "supplementary". As it's the exact same metrics, to know the "real" _totalreads count don't we need to substract the "secondary" reads from flagstat?

sitag commented 5 years ago

@paulstretenowich yes... we should subtract both secondary and supplementary counts for both total and mapped

paulstretenowich commented 5 years ago

@sitag For mapped too? I'm modifying the code for our pipelines so I can do the same modification in the IHEC repo if you want?

sitag commented 5 years ago

@paulstretenowich thanks! send a pull request... but first we will need test cases. are you tests though. do you have tests in mind?

i think we need it for mapped too (secondary alignments should not count? are they included in mapped reads count? i thought they might be based on last time i looked at samtools) @dbujold thoughts?

sitag commented 5 years ago

@paulstretenowich do we have a fix?

dbujold commented 5 years ago

@sitag I'd say they should not count. Paul will submit a PR shortly, thanks!

paulstretenowich commented 5 years ago

@sitag Yes they're included in mapped reads counts. I fixed it in GenPipes but I was waiting for @dbujold to confirm the correction, sorry for the late reply. The PR is coming.

paulstretenowich commented 4 years ago

@sitag @dbujold, what is your mind about either using sambamba for the metrics or upgrade samtools version to use multi-threading to speed up the metrics calculation?

sitag commented 4 years ago

@paulstretenowich @dbujold These scripts are deprecated since either we need to use the values ENCODE pipeline is computing for all these things - e.g. FrIP, SJ and bamstats, or we need to roll these into the integrative chip pipeline and use whatever - sambamba / samtools is on the ENCODE image instead of relying on whatever is present locally (older samtools has different flagstat fields). I don't think it make sense to maintain these - the ChIP pipeline is already doing all this work, no need to replicate it. The ENCODE computations for SJ is using the same bin size; I doubt that's making too much difference.. I will need to look.

Either way, we need to decide. pinging @martinhirst @dzerbino

sitag commented 4 years ago

Note I don't think anyone is reporting the metrics from their own pipeline. Not sure anyone is looking at these stats. If you still want to compute them on your local analysis, we should still run this computation off the encode container image.

martinhirst commented 4 years ago

I am fine to replace samtools with sambamba to allow for multithreading for the metrics calculation.

Martin

On Nov 11, 2019, at 10:21 PM, sitag notifications@github.com<mailto:notifications@github.com> wrote:

Note I don't think anyone is reporting the metrics from their own pipeline. Not sure anyone is looking at these stats. If you still want to compute them on your local analysis, we should still run this computation off the encode container image.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/IHEC/ihec-assay-standards/issues/6?email_source=notifications&email_token=ACK66UG6COQHFVSFC2G5VWDQTJDOFA5CNFSM4I2RYAY2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDZFMCI#issuecomment-552752649, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACK66UH4MGIXRFD4VGTN4WTQTJDOFANCNFSM4I2RYAYQ.