CollasLab / edd

Enriched Domain Detector for ChIP-seq data
https://pypi.python.org/pypi/edd
MIT License
16 stars 4 forks source link

error probably during reading of input bam #3

Closed steffenheyne closed 9 years ago

steffenheyne commented 9 years ago

With some input-bam files edd is running fine, but with some input-bam files (all similar bwa mappings) I get this error. Any idea? I don't know if the input bam is really the error or if it is the signal bam, but it seems like this... ... [2015-07-08 09:11:17.311074] NOTICE: edd: output dir: test_neu2 [2015-07-08 09:11:17.311432] NOTICE: edd: number of monte carlo trials: 10000 [2015-07-08 09:11:17.311785] NOTICE: edd: number of processes: 16 [2015-07-08 09:11:17.312134] NOTICE: edd: fdr lim: 0.050 [2015-07-08 09:11:17.312480] NOTICE: edd: gap penalty: 2.00 [2015-07-08 09:11:17.312815] NOTICE: edd: bin_size: 1000 KB [2015-07-08 09:11:17.313142] NOTICE: edd: unalignable regions file : uar.bed [2015-07-08 09:11:17.342770] NOTICE: edd: Writing log ratios: True [2015-07-08 09:11:17.343178] NOTICE: edd: EDD configuration file parameters: [2015-07-08 09:11:17.343536] NOTICE: edd: ci_method:agresti_coull [2015-07-08 09:11:17.343892] NOTICE: edd: fraq_ibins:0.99 [2015-07-08 09:11:17.344249] NOTICE: edd: log_ratio_bin_size:10000 [2015-07-08 09:11:17.344595] NOTICE: edd: ci_lim:0.25 [2015-07-08 09:11:17.435599] NOTICE: eddlib.experiment: loading bam files Traceback (most recent call last): File "/package/edd-1.1.13/bin/edd", line 146, in main(args, config) File "/package/edd-1.1.13/bin/edd", line 60, in main loader.load_single_experiment(args.ip_bam, args.input_bam) File "/package/edd-1.1.13/lib/python2.7/site-packages/eddlib/experiment.py", line 165, in load_single_experiment self.exp = self.load_bam(ip_name, ctrl_name) File "/package/edd-1.1.13/lib/python2.7/site-packages/eddlib/experiment.py", line 135, in load_bam use_multiprocessing=True) File "/package/edd-1.1.13/lib/python2.7/site-packages/eddlib/experiment.py", line 48, in load_experiment ipd, inputd = fmap(f, [ip_bam_path, input_bam_path]) File "/package/edd-1.1.13/lib/python2.7/site-packages/eddlib/experiment.py", line 44, in fmap = lambda g, xs: pool.map_async(g, xs).get(99999999) File "/usr/lib64/python2.7/multiprocessing/pool.py", line 554, in get raise self._value AssertionError

The input bam files are big, but the working ones are in the same size range. Flagstats on the not working one: 170700080 + 0 in total (QC-passed reads + QC-failed reads) 20352591 + 0 duplicates 167494222 + 0 mapped (98.12%:-nan%) 170700080 + 0 paired in sequencing 85350040 + 0 read1 85350040 + 0 read2 159439464 + 0 properly paired (93.40%:-nan%) 165837944 + 0 with itself and mate mapped 1656278 + 0 singletons (0.97%:-nan%) 5338790 + 0 with mate mapped to a different chr 3845489 + 0 with mate mapped to a different chr (mapQ>=5)

eivindgl commented 9 years ago

Dear Steffen,

I think you are right, something unexpected aboout the input file causes the program crash and that should not happen. Sadly, the stack trace does not help much here. The read bam function is written in C (Cython) and the IP and Input files are read in parallel using the multiprocessing module. One or both of the read operations crashes, but the stack trace reports only up to the multiprocessing fork.

Could you try the following? Using samtools or something similar, copy the first/last million reads into a new smaller bam file. If we are lucky, then EDD also crashes with this smaller subset. If you could send me these smaller files, then I am sure I can figure out what's going on.

steffenheyne commented 9 years ago

So far all downsampled bam files work. I would really like to see the bug solved and I can make you the large files available. Please let me know if you are interested and I will write you an email. Thanks, steffen

eivindgl commented 9 years ago

Great, just email me the download details and I'll fix it: gardlund at gmail sorry about the delay, but I am currently on holiday

eivindgl commented 9 years ago

Hi Steffen, I think I solved it. I had added a requirement that a read must be within the chromosome boundaries, but I guess some aligners allow reads to be hanging on either end of the chromosome if the match is good enough. In your case it was this read 39V34V1:131:C4HCTACXX:2:2214:2047:61825 aligned to the end of scaffold JH584295.1. This requirement was something I thought was useful during development, but all it did was to trigger this bug report ;)

I have now removed this check and hopefully everything works if you update EDD to version 1.1.14

pip install --upgrade edd
steffenheyne commented 9 years ago

HI Eivind,

great! Now it seems to work for our input files! Thanks a lot for looking into this issue! best, steffen

eivindgl commented 9 years ago

Great!