Closed friederhadlich closed 4 years ago
Thanks for creating an issue. Do you have the possibility to share the files with me?
The error is here:
https://github.com/biocore-ntnu/epic2/blob/master/epic2/src/reads_to_bins.pyx#L61
The error message likely means that either one of your chromsizes or one of the positions in the alignment files are above the uint32 max... Or that they were a negative number and got misinterpreted as a large int.
If you cannot share the files I suggest you go through the starts and ends in your alignment files and see what the max and min value of all the read positions are. Sometimes nonaligned reads are given a start and end of -1 for example.
I used Bos_taurus.ARS-UCD1.2.dna.toplevel.fa as genomic reference for the BWA mapping. Real data cannot be uploaded ... I will check the mapping results for erroneous positions and let you know. Thanks a lot!
Try converting to bed files with bedtools and see if it works then? Good luck and feel free to ping me for anything. Without the files it is impossible for me to debug as I am sure you understand :)
On Mon, Jul 13, 2020 at 2:51 PM friederhadlich notifications@github.com wrote:
I used Bos_taurus.ARS-UCD1.2.dna.toplevel.fa as genomic reference for the BWA mapping. Real data cannot be uploaded ... I will check the mapping results for erroneous positions and let you know. Thanks a lot!
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/biocore-ntnu/epic2/issues/36#issuecomment-657541892, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEHURUXFBZS5Z56PIDPDQXLR3L7OZANCNFSM4OYOFEEQ .
Unfortunately, bed files aren't the solution for me! I tried control and treatment inputs with bam and bed format without any effect.
The problem must be in parameter "fs" ... please see these calls:
epic2 -t ../3_bwa/H3K27Ac_MACT_p59_E.sorted.bam -bin 200 -g 3 -fdr 0.05 -fs 300 -egf 0.8 -c ../3_bwa/INPUT_pool_MACT_E.sorted.bam.bed -cs Bos_taurus.ARS-UCD1.2.dna.toplevel.chromsizes > H3K27Ac_MACT_p59_E.txt
Found a median readlength of 50.0
Using chromosome sizes found in Bos_taurus.ARS-UCD1.2.dna.toplevel.chromsizes.
Using an effective genome length of ~2172 * 1e6
Parsing ChIP file(s): ../3_bwa/H3K27Ac_MACT_p59_E.sorted.bam Traceback (most recent call last): File "/home/hadlich/miniconda/bin/epic2", line 248, in
_main(args) File "/home/hadlich/miniconda/lib/python3.7/site-packages/epic2/main.py", line 40, in _main args["treatment"], args, "ChIP") File "epic2/src/reads_to_bins.pyx", line 183, in epic2.src.reads_to_bins.files_to_bin_counts File "epic2/src/reads_to_bins.pyx", line 309, in epic2.src.reads_to_bins.files_to_bin_counts File "epic2/src/reads_to_bins.pyx", line 61, in epic2.src.reads_to_bins.remove_out_of_bounds_bins OverflowError: Python int too large to convert to C long
epic2 -t ../3_bwa/H3K27Ac_MACT_p59_E.sorted.bam -bin 200 -g 3 -fdr 0.05 -egf 0.8 -c ../3_bwa/INPUT_pool_MACT_E.sorted.bam.bed -cs Bos_taurus.ARS-UCD1.2.dna.toplevel.chromsizes > H3K27Ac_MACT_p59_E.txt
Found a median readlength of 50.0
Using chromosome sizes found in Bos_taurus.ARS-UCD1.2.dna.toplevel.chromsizes.
Using an effective genome length of ~2172 * 1e6
Parsing ChIP file(s): ../3_bwa/H3K27Ac_MACT_p59_E.sorted.bam
Valid ChIP reads: 11245276 (11245276 before out of bounds removal)
Score threshold: 19.652
Number of tags in a window: 3
Number of islands found: 39544
Parsing Input file(s): ../3_bwa/INPUT_pool_MACT_E.sorted.bam.bed
Valid Background reads: 23502103 (23502128 before out of bounds removal)
Chromosome positions of the mapping file are fine for treatment and control, as you can see here for control:
cat ../3_bwa/INPUT_pool_MACT_E.sorted.bam.bed | cut -f 3 | awk 'BEGIN{a=10000;b=0};{if ($0<a) a=$0; if ($0>b) b=$0}; END{print a,b;}'
37 158533984
cat ../3_bwa/INPUT_pool_MACT_E.sorted.bam.bed | cut -f 2 | awk 'BEGIN{a=10000;b=0};{if ($0<a) a=$0; if ($0>b) b=$0}; END{print a,b;}'
0 158533934
One more running example:
epic2 -t ../3_bwa/H3K27Ac_MACT_p59_E.sorted.bam -bin 500 -g 3 -fdr 0.05 -fs 300 -egf 0.8 -c ../3_bwa/INPUT_pool_MACT_E.sorted.bam.bed -cs Bos_taurus.ARS-UCD1.2.dna.toplevel.chromsizes > H3K27Ac_MACT_p59_E.txt
I will look at it in a few weeks. Taking a vacation now :)
Thanks for providing more details :)
On Wednesday, July 15, 2020, friederhadlich notifications@github.com wrote:
One more running example: epic2 -t ../3_bwa/H3K27Ac_MACT_p59_E.sorted.bam -bin 500 -g 3 -fdr 0.05 -fs 300 -egf 0.8 -c ../3_bwa/INPUT_pool_MACT_E.sorted.bam.bed -cs Bos_taurus.ARS-UCD1.2.dna.toplevel.chromsizes > H3K27Ac_MACT_p59_E.txt
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/biocore-ntnu/epic2/issues/36#issuecomment-658719246, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEHURURJZXZ7WV7FYTLFOQTR3WIYPANCNFSM4OYOFEEQ .
I am also seeing this error -- but I figured out a temporary workaround. I am using hg38 chrom.sizes from UCSC link and a paired-end fastq file which I aligned with bwa and converted to a bed file.
Here is the epic2 command:
epic2 -t RSeq_out/tmp/pe_beds/SRX2481503_S96_DRIP_Seq_1.hg38.experiment.bed -e 100 -fdr .01 --effective-genome-fraction 0.6753009149366569 --mapq 30 --chromsizes RSeq_out/tmp/hg38.chrom.sizes --output RSeq_out/peaks_epic_unstranded/SRX2481503_S96_DRIP_Seq_1_hg38.bed
And the output:
Found a median fragment size of 183.5
Using chromosome sizes found in RSeq_out/tmp/hg38.chrom.sizes.
Using an effective genome length of ~2167 * 1e6
Parsing ChIP file(s):
RSeq_out/tmp/pe_beds/SRX2481503_S96_DRIP_Seq_1.hg38.experiment.bed
Traceback (most recent call last):
File "/home/UTHSCSA/millerh1/miniconda3/envs/RSeqEnv/bin/epic2", line 248, in <module>
_main(args)
File "/home/UTHSCSA/millerh1/miniconda3/envs/RSeqEnv/lib/python3.7/site-packages/epic2/main.py", line 40, in _main
args["treatment"], args, "ChIP")
File "epic2/src/reads_to_bins.pyx", line 183, in epic2.src.reads_to_bins.files_to_bin_counts
File "epic2/src/reads_to_bins.pyx", line 309, in epic2.src.reads_to_bins.files_to_bin_counts
File "epic2/src/reads_to_bins.pyx", line 61, in epic2.src.reads_to_bins.remove_out_of_bounds_bins
OverflowError: Python int too large to convert to C long
The error goes away when I remove all the non-standard chromosomes from the chrom.sizes
file, though I don't understand why. Regardless most people will have these non-standard chromosomes so it would be great if using them was supported.
Here's chrom.sizes
modified by me:
chr1 248956422
chr2 242193529
chr3 198295559
chr4 190214555
chr5 181538259
chr6 170805979
chr7 159345973
chrX 156040895
chr8 145138636
chr9 138394717
chr11 135086622
chr10 133797422
chr12 133275309
chr13 114364328
chr14 107043718
chr15 101991189
chr16 90338345
chr17 83257441
chr18 80373285
chr20 64444167
chr19 58617616
chrY 57227415
chr22 50818468
chr21 46709983
The output of the same epic2 command as before, but with the updated chrom.sizes
:
effective-genome-fraction 0.6753009149366569 --mapq 30 --chromsizes RSeq_out/tmp/hg38.chrom.sizes --output RSeq_out/peaks_epic_unstranded/SRX2481503_S96_DRIP_Seq_1_hg38.bed
Found a median fragment size of 183.5
Using chromosome sizes found in RSeq_out/tmp/hg38.chrom.sizes.
Using an effective genome length of ~2085 * 1e6
Parsing ChIP file(s):
RSeq_out/tmp/pe_beds/SRX2481503_S96_DRIP_Seq_1.hg38.experiment.bed
Chromosome chr10_GL383545v1_alt not in the chromosome sizes: chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY
Chromosome chr10_GL383545v1_alt not in the chromosome sizes: chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY
...
Chromosome chrUn_KI270757v1 not in the chromosome sizes: chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY
Chromosome chrX_KI270880v1_alt not in the chromosome sizes: chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY
Chromosome chrX_KI270913v1_alt not in the chromosome sizes: chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY
Valid ChIP reads: 24012551 (24017420 before out of bounds removal)
Score threshold: 27.632
Number of tags in a window: 5
Number of islands found: 6736
Empirical estimate of FDR is: 0.014845605700712588 (100/6736)
Hope this is helpful!
Best, Henry
I will definitely try to fix this soon. Will test with non-standard chromosomes and see what happens.
On Tue, Jul 28, 2020 at 11:16 PM Henry Miller notifications@github.com wrote:
I am also seeing this error -- but I figured out a temporary workaround. I am using hg38 chrom.sizes from UCSC link https://hgdownload-test.gi.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes and a paired-end fastq file which I aligned with bwa and converted to a bed file.
Here is the epic2 command:
epic2 -t RSeq_out/tmp/pe_beds/SRX2481503_S96_DRIP_Seq_1.hg38.experiment.bed -e 100 -fdr .01 --effective-genome-fraction 0.6753009149366569 --mapq 30 --chromsizes RSeq_out/tmp/hg38.chrom.sizes --output RSeq_out/peaks_epic_unstranded/SRX2481503_S96_DRIP_Seq_1_hg38.bed
And the output:
Found a median fragment size of 183.5
Using chromosome sizes found in RSeq_out/tmp/hg38.chrom.sizes.
Using an effective genome length of ~2167 * 1e6
Parsing ChIP file(s): RSeq_out/tmp/pe_beds/SRX2481503_S96_DRIP_Seq_1.hg38.experiment.bed Traceback (most recent call last): File "/home/UTHSCSA/millerh1/miniconda3/envs/RSeqEnv/bin/epic2", line 248, in
_main(args) File "/home/UTHSCSA/millerh1/miniconda3/envs/RSeqEnv/lib/python3.7/site-packages/epic2/main.py", line 40, in _main args["treatment"], args, "ChIP") File "epic2/src/reads_to_bins.pyx", line 183, in epic2.src.reads_to_bins.files_to_bin_counts File "epic2/src/reads_to_bins.pyx", line 309, in epic2.src.reads_to_bins.files_to_bin_counts File "epic2/src/reads_to_bins.pyx", line 61, in epic2.src.reads_to_bins.remove_out_of_bounds_bins OverflowError: Python int too large to convert to C long The error goes away when I remove all the non-standard chromosomes from the chrom.sizes file, though I don't understand why. Regardless most people will have these non-standard chromosomes so it would be great if using them was supported.
Here's chrom.sizes modified by me:
chr1 248956422 chr2 242193529 chr3 198295559 chr4 190214555 chr5 181538259 chr6 170805979 chr7 159345973 chrX 156040895 chr8 145138636 chr9 138394717 chr11 135086622 chr10 133797422 chr12 133275309 chr13 114364328 chr14 107043718 chr15 101991189 chr16 90338345 chr17 83257441 chr18 80373285 chr20 64444167 chr19 58617616 chrY 57227415 chr22 50818468 chr21 46709983
The output of the same epic2 command as before, but with the updated chrom.sizes:
effective-genome-fraction 0.6753009149366569 --mapq 30 --chromsizes RSeq_out/tmp/hg38.chrom.sizes --output RSeq_out/peaks_epic_unstranded/SRX2481503_S96_DRIP_Seq_1_hg38.bed Found a median fragment size of 183.5
Using chromosome sizes found in RSeq_out/tmp/hg38.chrom.sizes.
Using an effective genome length of ~2085 * 1e6
Parsing ChIP file(s): RSeq_out/tmp/pe_beds/SRX2481503_S96_DRIP_Seq_1.hg38.experiment.bed Chromosome chr10_GL383545v1_alt not in the chromosome sizes: chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY Chromosome chr10_GL383545v1_alt not in the chromosome sizes: chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY ... Chromosome chrUn_KI270757v1 not in the chromosome sizes: chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY Chromosome chrX_KI270880v1_alt not in the chromosome sizes: chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY Chromosome chrX_KI270913v1_alt not in the chromosome sizes: chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY
Valid ChIP reads: 24012551 (24017420 before out of bounds removal)
Score threshold: 27.632
Number of tags in a window: 5
Number of islands found: 6736
Empirical estimate of FDR is: 0.014845605700712588 (100/6736)
Hope this is helpful!
Best, Henry
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/biocore-ntnu/epic2/issues/36#issuecomment-665287584, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEHURUULS44QJPLXKM365ZLR5452HANCNFSM4OYOFEEQ .
I have now started my sleuthing to fix this problem.
There is one thing which I did not think of communicating: if you include chromosomes in the chromsizes file that are not in the reads, the algorithm becomes more conservative. It computes a Poisson parameter based on the number of reads in the genome. With more chromosomes, the genome becomes longer, and hence less sparsely populated.
I have tried:
but have been unable to break epic2. If anyone can upload a breaking example I will do my best to fix this problem. But I have not been able to reproduce. Version: 0.0.41
. If the data is semi-sensitive you can 7zip with a password and send the file by e-mail.
epic2 -bin 200 -g 3 -fdr 0.05 -t /mnt/cargo/endrebak/control.sorted.filtered.bam -cs Bos_taurus.ARS-UCD1.2.dna.toplevel.chromsizes > /dev/null
Found a median readlength of 50.0
Using chromosome sizes found in Bos_taurus.ARS-UCD1.2.dna.toplevel.chromsizes.
Using an effective genome length of ~2344 * 1e6
Parsing ChIP file(s):
/mnt/cargo/endrebak/control.sorted.filtered.bam
Valid ChIP reads: 1949548 (1949548 before out of bounds removal)
Score threshold: 22.615000000000002
Number of tags in a window: 1
Number of islands found: 16028
Empirical estimate of FDR is: 0.062390816071874224 (1000/16028)
I.e. was not able to reproduce. Tomorrow I will wrap the affected code in a try/catch and print the error with detailed information. Then you can report that info back to me and I will try my best to understand and fix it :)
Hi Endre,
please use treatment and control argument to reproduce the error ... i.e.
epic2 -t treatment.sorted.bam -bin 200 -g 3 -fdr 0.05 -fs 300 -egf 0.8 -c control.sorted.filtered.bam -cs Bos_taurus.ARS-UCD1.2.dna.toplevel.chromsizes
Good luck, Frieder
epic2 -fs 300 -bin 200 -g 3 -fdr 0.05 -t /mnt/cargo/endrebak/treatment.sorted.filtered.bam -c /mnt/cargo/endrebak/control.sorted.filtered.bam -cs Bos_taurus.ARS-UCD1.2.dna.toplevel.chromsizes > /dev/null
Found a median readlength of 50.0
Using chromosome sizes found in Bos_taurus.ARS-UCD1.2.dna.toplevel.chromsizes.
Using an effective genome length of ~2344 * 1e6
Parsing ChIP file(s):
/mnt/cargo/endrebak/treatment.sorted.filtered.bam
Valid ChIP reads: 1185813 (1185813 before out of bounds removal)
Score threshold: 17.337
Number of tags in a window: 1
Number of islands found: 9343
Parsing Input file(s):
/mnt/cargo/endrebak/control.sorted.filtered.bam
Valid Background reads: 1949548 (1949548 before out of bounds removal)
I still am unable to reproduce :) So I will stick to my plan of implementing good reporting in the case of errors.
I am using version 0.0.41 and linux. What about you people?
Now a 0.0.42 is out which should give a better error message when it happens :).
Strange! All my samples pass epic2 with version 0.0.42! Thx a lot for your support!
Indexplicable weirdness is common 🙈 Thanks for reporting.
It might have been that I forgot to raise an exception if I caught one. Were any error messages?
Look for an error message like this:
except OverflowError as e:
print(e)
print("\nAdditional info:\n")
print("Chromosome:", chromosome)
print("Chromsize:", chromsize)
0.0.43
(hotfix) should raise an error if OverflowErrors happen again. These were noisily ignored in 0.0.42
I am still getting this error with epic2 0.0.42... though it finishes and gives some additional info:
(base) millerh1@cbbi16:~/Bishop.lab/Projects/RSeq$ epic2 -t RSeq_out/tmp/pe_beds/SRX2481503_S96_DRIP_Seq_1.hg38.experiment.bed -e 100 -fdr .01 --effective-genome-fraction 0.6753009149366569 --mapq 30 --chromsizes RSeq_out/tmp/hg38.chrom.sizes --output RSeq_out/peaks_epic_unstranded/SRX2481503_S96_DRIP_Seq_1_hg38.bed
Found a median fragment size of 183.5
Using chromosome sizes found in RSeq_out/tmp/hg38.chrom.sizes.
Using an effective genome length of ~2167 * 1e6
Parsing ChIP file(s):
RSeq_out/tmp/pe_beds/SRX2481503_S96_DRIP_Seq_1.hg38.experiment.bed
Python int too large to convert to C long
Additional info:
Chromosome: chr2_GL582966v2_alt
Chromsize: 96131
Valid ChIP reads: 24120376 (24125321 before out of bounds removal)
Score threshold: 45.687
Number of tags in a window: 4
Number of islands found: 5165
Empirical estimate of FDR is: 0.01936108422071636 (100/5165)
The chrom.sizes are here: http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes
I'm hosting the bed file here: https://hem-misc.s3-us-west-2.amazonaws.com/SRX2481503_S96_DRIP_Seq_1.hg38.experiment.bed
Thank you!
Henry
I will look at it. Thanks!
On Fri, Aug 14, 2020 at 10:47 PM Henry Miller notifications@github.com wrote:
I am still getting this error with epic2 0.0.42... though it finishes and gives some additional info:
(base) millerh1@cbbi16:~/Bishop.lab/Projects/RSeq$ epic2 -t RSeq_out/tmp/pe_beds/SRX2481503_S96_DRIP_Seq_1.hg38.experiment.bed -e 100 -fdr .01 --effective-genome-fraction 0.6753009149366569 --mapq 30 --chromsizes RSeq_out/tmp/hg38.chrom.sizes --output RSeq_out/peaks_epic_unstranded/SRX2481503_S96_DRIP_Seq_1_hg38.bed Found a median fragment size of 183.5
Using chromosome sizes found in RSeq_out/tmp/hg38.chrom.sizes.
Using an effective genome length of ~2167 * 1e6
Parsing ChIP file(s): RSeq_out/tmp/pe_beds/SRX2481503_S96_DRIP_Seq_1.hg38.experiment.bed Python int too large to convert to C long
Additional info:
Chromosome: chr2_GL582966v2_alt Chromsize: 96131
Valid ChIP reads: 24120376 (24125321 before out of bounds removal)
Score threshold: 45.687
Number of tags in a window: 4
Number of islands found: 5165
Empirical estimate of FDR is: 0.01936108422071636 (100/5165)
The chrom.sizes are here: http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes
I'm hosting the bed file here: https://hem-misc.s3-us-west-2.amazonaws.com/SRX2481503_S96_DRIP_Seq_1.hg38.experiment.bed
Thank you!
Henry
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/biocore-ntnu/epic2/issues/36#issuecomment-674265922, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEHURUWDMIUJYM2YXMY5GBLSAWPGDANCNFSM4OYOFEEQ .
What about now? 0.0.44 is out :)
Well the number of reads and peaks does not change, but the error disappeared! I'm assuming that this is alright now. Thanks!
No problem. There must have been some weirdness with the Cython and C++ interop is my guess :)
Hi Endre,
I sequentially used this call
epic2 -bin 200 -g 3 -fdr 0.05 -t $TREATMENT -c $CONTROL -cs ../4_epic2/Bos_taurus.ARS-UCD1.2.dna.toplevel.chromsizes
and epic2 stops with an error message for some of my treatments.Do you have an idea of how to continue?
This is the typical error (console output):
I checked the chromosome sizes and they are all fine! Maybe the BWA mappings ahead cause the error ... but why this error occurs only for a few treatments?
Hope you can help ... or give an idea to continue.
Greetings from Germany,
Frieder