YeoLab / outrigger

Create a *de novo* alternative splicing database, validate splicing events, and quantify percent spliced-in (Psi) from RNA seq data
http://yeolab.github.io/outrigger/
BSD 3-Clause "New" or "Revised" License
62 stars 22 forks source link

Segmentation fault on large datasets #64

Open olgabot opened 7 years ago

olgabot commented 7 years ago

Description

Using an 11G reads.csv file of junction reads, there's a segmentation fault. This is for ~300 samples and I expect datasets to be much bigger (10,000s) so failing at this level is unacceptable.

Traceback (most recent call last):
  File "/home/obotvinnik/anaconda/bin/outrigger", line 9, in <module>
    load_entry_point('outrigger', 'console_scripts', 'outrigger')()
  File "/home/obotvinnik/workspace-git/outrigger/outrigger/commandline.py", line 1033, in main
    cl = CommandLine(sys.argv[1:])
  File "/home/obotvinnik/workspace-git/outrigger/outrigger/commandline.py", line 334, in __init__
    self.args.func()
  File "/home/obotvinnik/workspace-git/outrigger/outrigger/commandline.py", line 346, in psi
    psi.execute()
  File "/home/obotvinnik/workspace-git/outrigger/outrigger/commandline.py", line 948, in execute
    junction_reads = self.csv()
  File "/home/obotvinnik/workspace-git/outrigger/outrigger/commandline.py", line 471, in csv
    low_memory=self.low_memory)
  File "/home/obotvinnik/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 645, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/home/obotvinnik/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 400, in _read
    data = parser.read()
  File "/home/obotvinnik/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 938, in read
    ret = self._engine.read(nrows)
  File "/home/obotvinnik/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 1507, in read
    data = self._reader.read(nrows)
  File "pandas/parser.pyx", line 849, in pandas.parser.TextReader.read (pandas/parser.c:10387)
  File "pandas/parser.pyx", line 937, in pandas.parser.TextReader._read_rows (pandas/parser.c:11556)
  File "pandas/parser.pyx", line 2018, in pandas.parser.raise_parser_error (pandas/parser.c:26979)
pandas.io.common.CParserError: Error tokenizing data. C error: out of memory
/var/spool/torque/mom_priv/jobs/7217194.tscc-mgr.local.SC: line 15: 17589 Segmentation fault      outrigger psi

real    0m41.556s
user    0m9.573s
sys     0m2.458s

Steps to Reproduce

  1. Use a deeply sequenced dataset of ~300 samples
  2. outrigger index runs fine
  3. outrigger psi dies when reading in the junction reads.csv from outrigger index.

Expected behavior: Did not expect a segfault (run out of memory)

Actual behavior: Ran out of memory on a 64GB memory supercomputer.

Versions

Linux x64, outrigger version 1.0.0rc1

olgabot commented 7 years ago

I believe this can be fixed with the flag --low-memory, which will be smarter about the memory usage. I'm not sure how this can be handled in the general case because the --low-memory flag is much slower, but it's not obvious to the user that they would need it. One option is to keep --low-memory as True, and only if the user feels it is running slow, they can add the --high-memory flag.