ENCODE-DCC / atac-seq-pipeline

ENCODE ATAC-seq pipeline
MIT License
385 stars 172 forks source link

incorrect computations for NFR and "Presence of NFR peak', 'Presence of Mono-Nuc peak','Presence of Di-Nuc peak'. #113

Closed yongchao closed 5 years ago

yongchao commented 5 years ago

Describe the bug Incorrect computations of NFR and "Presence of NFR peak', 'Presence of Mono-Nuc peak','Presence of Di-Nuc peak'.

Details description

The source code for defining NFR is at https://github.com/ENCODE-DCC/atac-seq-pipeline/blob/841867d3e188e0e3be9b8e3f45caf3ba4f1df041/src/run_ataqc.py#L895 NFR_UPPER_LIMIT = 150 percent_nfr = data[:NFR_UPPER_LIMIT].sum() / data.sum()

"data" variable is a 2D matrix obtained from the histogram with the picard command CollectInsertSizeMetrics. The first column insert_size is in bp and the second column All_Reads.fr_count is the counts of the fragment with the specified insert_size.

So the computation data[:150].sum is NOT to count the number of fragments less than 150bp, but to take the first 150 rows of data, and then sum the two columns altogether.

This programming has two flaws, a) the first column (size) should NOT be in the computation, and second, the first 150 rows may be corresponding to different insert sizes with different bam file. For the particular data I have looked at, it is between 20bp-170bp, which may include some of the mono-nucleasome regions.

There are similar flaws in determining "Presence of NFR peak', 'Presence of Mono-Nuc peak','Presence of Di-Nuc peak'.

OS/Platform and dependencies

Attach error logs For Cromwell users only. 1) Move to your working directory where you ran a pipeline. You should be able to find a directory named cromwell-executions/ which includes all outputs and logs for debugging.

2) Run the following command line to print all non-empty STDERR outputs. This will be greatly helpful for developers to figure out the problem. Copy-paste its output to the issue page.

$ find -name stderr -not -empty | xargs tail -n +1

Picked up _JAVA_OPTIONS: -Xms256M -Xmx15000M -XX:ParallelGCThreads=1 -Djava.io.tmpdir=/sc/orga/projects/sealfs01a/sealfonlab_seq_projects/hiseq_139/atac_seq/encode_pipeline/cromwell-executions/atac/074c8620-d6d6-47ab-801b-7ed8ec44d21b/call-ataqc/shard-0/tmp.5c4484a3 Picked up _JAVA_OPTIONS: -Xms256M -Xmx15000M -XX:ParallelGCThreads=1 -Djava.io.tmpdir=/sc/orga/projects/sealfs01a/sealfonlab_seq_projects/hiseq_139/atac_seq/encode_pipeline/cromwell-executions/atac/074c8620-d6d6-47ab-801b-7ed8ec44d21b/call-ataqc/shard-0/tmp.5c4484a3 [bam_sort_core] merging from 64 files... Picked up _JAVA_OPTIONS: -Xms256M -Xmx15000M -XX:ParallelGCThreads=1 -Djava.io.tmpdir=/sc/orga/projects/sealfs01a/sealfonlab_seq_projects/hiseq_139/atac_seq/encode_pipeline/cromwell-executions/atac/074c8620-d6d6-47ab-801b-7ed8ec44d21b/call-ataqc/shard-0/tmp.5c4484a3 processing chromosomes

3) (OPTIONAL) Run the following command to collect all logs. For developer's convenience, please add [ISSUE_ID] to the name of the tar ball file. This command will generate a tar ball including all debugging information. Post an issue with the tar ball (.tar.gz) attached.

$ find . -type f -name 'stdout' -or -name 'stderr' -or -name 'script' -or \
-name '*.qc' -or -name '*.txt' -or -name '*.log' -or -name '*.png' -or -name '*.pdf' \
| xargs tar -zcvf debug_[ISSUE_ID].tar.gz

debug.tar.gz

akundaje commented 5 years ago

Thanks for pointing this out. We'll look into it and fix ASAP

vervacity commented 5 years ago

Sorry I understand the peak presence issue - will change the PR request to include that fix

yongchao commented 5 years ago

Thanks for the fix. The code is now doing what is supposed to be done.

On Thu, Apr 18, 2019 at 9:36 PM Jin Lee notifications@github.com wrote:

Closed #113 https://github.com/ENCODE-DCC/atac-seq-pipeline/issues/113 via #114 https://github.com/ENCODE-DCC/atac-seq-pipeline/pull/114.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ENCODE-DCC/atac-seq-pipeline/issues/113#event-2287270298, or mute the thread https://github.com/notifications/unsubscribe-auth/AALYMN526O5QPRDFHASM5XLPREOYTANCNFSM4HFIZ5ZQ .