dieterich-lab / circtools

circtools: a modular, python-based framework for circRNA-related tools that unifies several functionalities in a single, command line driven software.
http://circ.tools
GNU General Public License v3.0
25 stars 20 forks source link

Enrichment module produces external bedtools error when using -I *_utr #30

Closed tjakobi closed 7 years ago

tjakobi commented 7 years ago

The feature selection for exon and gene works without any errors. However, the exact same work flow fails after quite some time (~30 minutes) with a bedtools error:

/biosw/bedtools/git_unstable/bin/bedtools
Parsing annotation...

Processed 133938 entries
Done parsing annotation
Parsing annotation...

Processed 142387 entries
Done parsing annotation
Parsing BED input file...
Done parsing BED input file:
=> 294976 peaks, 45 nt average width
Parsing annotation...

Processed 58051 entries
Done parsing annotation
Parsing circular RNA input file...
Done parsing circular RNA input file:
=> 1956 circular RNAs, 11218 nt average (theoretical unspliced) length
Starting random shuffling of input peaks
Starting data acquisition from samplings
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/biosw/python3/3.5.1/lib/python3.5/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/biosw/python3/3.5.1/lib/python3.5/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/home/tjakobi/.local/lib/python3.5/site-packages/circtools/enrichment/enrichment_check.py", line 761, in random_sample_step
    circular_intersect = self.do_intersection(shuffled_peaks[iteration], circ_rna_bed)
  File "/home/tjakobi/.local/lib/python3.5/site-packages/circtools/enrichment/enrichment_check.py", line 480, in do_intersection
    intersect_return = base_bed.intersect(query_bed, c=True)
  File "/home/tjakobi/.local/lib/python3.5/site-packages/pybedtools/bedtool.py", line 806, in decorated
    result = method(self, *args, **kwargs)
  File "/home/tjakobi/.local/lib/python3.5/site-packages/pybedtools/bedtool.py", line 337, in wrapped
    decode_output=decode_output,
  File "/home/tjakobi/.local/lib/python3.5/site-packages/pybedtools/helpers.py", line 356, in call_bedtools
    raise BEDToolsError(subprocess.list2cmdline(cmds), stderr)
pybedtools.helpers.BEDToolsError:
Command was:

        bedtools intersect -c -b /scratch/tjakobi/circtools/RBFOX2_HepG2_combined/pybedtools.604gh2qr.tmp -a /scratch/tjakobi/circtools/RBFOX2_HepG2_combined/pybedtools.m5g3e86b.tmp

Error message was:
Error: line number 216606 of file /scratch/tjakobi/circtools/RBFOX2_HepG2_combined/pybedtools.604gh2qr.tmp has 3 fields, but 6 were expected.

"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/tjakobi//.local/bin/circtools", line 18, in <module>
    import circtools
  File "/home/tjakobi/.local/lib/python3.5/site-packages/circtools/__init__.py", line 2, in <module>
    main()
  File "/home/tjakobi/.local/lib/python3.5/site-packages/circtools/circtools.py", line 30, in main
    CircTools()
  File "/home/tjakobi/.local/lib/python3.5/site-packages/circtools/circtools.py", line 66, in __init__
    getattr(self, args.command)()
  File "/home/tjakobi/.local/lib/python3.5/site-packages/circtools/circtools.py", line 187, in enrich
    enrich.run_module()
  File "/home/tjakobi/.local/lib/python3.5/site-packages/circtools/enrichment/enrichment_check.py", line 180, in run_module
    ), range(self.cli_params.num_iterations + 1))
  File "/biosw/python3/3.5.1/lib/python3.5/multiprocessing/pool.py", line 260, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/biosw/python3/3.5.1/lib/python3.5/multiprocessing/pool.py", line 608, in get
    raise self._value
pybedtools.helpers.BEDToolsError:
Command was:

        bedtools intersect -c -b /scratch/tjakobi/circtools/RBFOX2_HepG2_combined/pybedtools.604gh2qr.tmp -a /scratch/tjakobi/circtools/RBFOX2_HepG2_combined/pybedtools.m5g3e86b.tmp

Error message was:
Error: line number 216606 of file /scratch/tjakobi/circtools/RBFOX2_HepG2_combined/pybedtools.604gh2qr.tmp has 3 fields, but 6 were expected.

Command exited with non-zero status 1
        Command being timed: "circtools enrich -c /home/tjakobi/work/projects/circRNA/encode3/dcc_k562_hepg2_total_vs_cytosol//circTest/circRNA_HepG2_RNaseR_P_signif_1percFDR.csv -b /home/tjakobi/work/data/circtools/encode_hg38_clip_peaks/HepG2/combined/RBFOX2_HepG2_combined.bed -a /home/tjakobi/work/data/circtools/input/GRCh38.85.gtf -g /home/tjakobi/work/data/circtools/input/hg38.chrom.sizes -i 10000 -I three_prime_utr -I five_prime_utr -p 40 -P 1 -T 1 -o /home/tjakobi/work/data/circtools/output/encode/hepg2/utr// -F RBFOX2_HepG2_combined -t /scratch/tjakobi/circtools/RBFOX2_HepG2_combined/"
        User time (seconds): 56499.98
        System time (seconds): 2763.36
        Percent of CPU this job got: 3341%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 29:33.64
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 148304
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 3629391
        Minor (reclaiming a frame) page faults: 352230918
        Voluntary context switches: 96145491
        Involuntary context switches: 4247851
        Swaps: 0
        File system inputs: 553612394
        File system outputs: 294311389
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 1

Interestingly, while the first temporary bed files exists, the second one cannot be recovered after the crash. Also the number of columns is different for the different runs:

tjakobi@porta:[run]{0}> grep "Error: line number" slurm-90*
slurm-90457.out:Error: line number 210067 of file /scratch/tjakobi/circtools/EIF3D_HepG2_combined/pybedtools.5fffa2o0.tmp has 4 fields, but 6 were expected.
slurm-90457.out:Error: line number 210067 of file /scratch/tjakobi/circtools/EIF3D_HepG2_combined/pybedtools.5fffa2o0.tmp has 4 fields, but 6 were expected.
slurm-90458.out:Error: line number 216897 of file /scratch/tjakobi/circtools/HNRNPC_HepG2_combined/pybedtools.0f_wvxui.tmp has 4 fields, but 6 were expected.
slurm-90458.out:Error: line number 216897 of file /scratch/tjakobi/circtools/HNRNPC_HepG2_combined/pybedtools.0f_wvxui.tmp has 4 fields, but 6 were expected.
slurm-90459.out:Error: line number 280584 of file /scratch/tjakobi/circtools/QKI_HepG2_combined/pybedtools.1j18rnyf.tmp has 5 fields, but 6 were expected.
slurm-90459.out:Error: line number 280584 of file /scratch/tjakobi/circtools/QKI_HepG2_combined/pybedtools.1j18rnyf.tmp has 5 fields, but 6 were expected.
slurm-90460.out:Error: line number 216606 of file /scratch/tjakobi/circtools/RBFOX2_HepG2_combined/pybedtools.604gh2qr.tmp has 3 fields, but 6 were expected.
slurm-90460.out:Error: line number 216606 of file /scratch/tjakobi/circtools/RBFOX2_HepG2_combined/pybedtools.604gh2qr.tmp has 3 fields, but 6 were expected.
tjakobi commented 7 years ago

I was able to recover one of the temporary files. Indeed, the file is missing a column as suggested by bedtools:

20      35705295        35705333        QKI_HepG2_rep02 200     +
1       27407577        27407615        QKI_HepG2_rep02 200     +
15      84869291        84869329        QKI_HepG2_rep02 200     +
5       149632806       149632843       QKI_HepG2_rep02 20

It seems the file is only partially complete since it is ending in the middle of a value.

tjakobi commented 7 years ago

A cluster wide search for possible segfaults of bedtools showed segfaults on 4 machines, which correlates with the number of failed jobs (also 4):

[2109293.051118] traps: bedtools[38431] trap divide error ip:4fba16 sp:7ffce107dda0 error:0 in bedtools[400000+231000]
[2110512.508267] traps: bedtools[32043] trap divide error ip:4fba16 sp:7fff75ec9ce0 error:0 in bedtools[400000+231000]
[2111499.498531] traps: bedtools[36229] trap divide error ip:4fba16 sp:7fff9685b8c0 error:0 in bedtools[400000+231000]
[2111904.482610] traps: bedtools[37884] trap divide error ip:4fba16 sp:7ffdb865bca0 error:0 in bedtools[400000+231000]
[2109686.312481] traps: bedtools[37998] trap divide error ip:4fba16 sp:7ffcd48e1b70 error:0 in bedtools[400000+231000]
[2110037.443168] traps: bedtools[25038] trap divide error ip:4fba16 sp:7ffe1e124be0 error:0 in bedtools[400000+231000]
tjakobi commented 7 years ago

A second run with the same input files shows that even the one job that finished successfully in the first try now fails with a segfault, too. Additionally the line numbers in the temporary files are different between the runs.

tjakobi commented 7 years ago

The error in bedtools is caused in void BedShuffle::ChooseLocusFromInclusionFile(BED &bedEntry):

#ifdef USE_RAND
    CHRPOS randomStart = includeInterval->start + (rand() % includeInterval->size());
#else
    CHRPOS randomStart = includeInterval->start + ((long) mt_rand() % includeInterval->size());
#endif

The interval size is not checked to be > 0. Therefore, if the input bed file has intervals with 0-length, bedtools with fail with a division by 0.

I'll submit a bug report and probably a pull request to get that fixed.