YeoLab / outrigger

Create a *de novo* alternative splicing database, validate splicing events, and quantify percent spliced-in (Psi) from RNA seq data
http://yeolab.github.io/outrigger/
BSD 3-Clause "New" or "Revised" License
62 stars 22 forks source link

UNIQUE id error on tasic2016_big dataset #69

Closed olgabot closed 7 years ago

olgabot commented 7 years ago

Description

There is an error that occurs when adding novel exons to the gffutils.FeatureDB during outrigger index.

Steps to Reproduce

On the branch v1.0.0rc1, there are additional SJ.out.tab files for testing. Using these files, there's an error when finding novel exons on chromosome 4:

$ python -m pdb outrigger/commandline.py index --sj-out-tab outrigger/tests/data/tasic2016/unprocessed/sj_out_tab/originals/CAV_LP_Ipsi_tdTpos_cell_1*SJ.out.tab --gtf outrigger/tests/data/tasic2016/unprocessed/gtf/gencode.vM10.annotation.subset.gtf --output $OUTPUT  
... lots of output ...
2017-01-04 12:34:53     Finding novel exons that are <=100nt between two junctions on chromosome chr4 ...
2017-01-04 12:35:03         Done.
2017-01-04 12:35:03         Filtering for only novel exons on chromosome chr4 ...
2017-01-04 12:35:03             Done.
2017-01-04 12:35:03         Creating gffutils.Feature objects for each novel exon, plus potentially its overlapping gene
2017-01-04 12:35:04             Done.
2017-01-04 12:35:04         Updating gffutils database with 1300 novel exons on chromosome chr4 ...
Traceback (most recent call last):
  File "/Users/olga/anaconda3/envs/outrigger/lib/python3.5/site-packages/gffutils/create.py", line 991, in _update_relations
    self._insert(f, c)
  File "/Users/olga/anaconda3/envs/outrigger/lib/python3.5/site-packages/gffutils/create.py", line 520, in _insert
    cursor.execute(constants._INSERT, feature.astuple())
sqlite3.IntegrityError: UNIQUE constraint failed: features.id

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/olga/anaconda3/envs/outrigger/lib/python3.5/pdb.py", line 1661, in main
    pdb._runscript(mainpyfile)
  File "/Users/olga/anaconda3/envs/outrigger/lib/python3.5/pdb.py", line 1542, in _runscript
    self.run(statement)
  File "/Users/olga/anaconda3/envs/outrigger/lib/python3.5/bdb.py", line 431, in run
    exec(cmd, globals, locals)
  File "<string>", line 1, in <module>
  File "/Users/olga/workspace-git/outrigger/outrigger/commandline.py", line 3, in <module>
    import argparse
  File "/Users/olga/workspace-git/outrigger/outrigger/commandline.py", line 1057, in main
    cl = CommandLine(sys.argv[1:])
  File "/Users/olga/workspace-git/outrigger/outrigger/commandline.py", line 343, in __init__
    self.args.func()
  File "/Users/olga/workspace-git/outrigger/outrigger/commandline.py", line 347, in index
    index.execute()
  File "/Users/olga/workspace-git/outrigger/outrigger/commandline.py", line 754, in execute
    metadata, db)
  File "/Users/olga/workspace-git/outrigger/outrigger/commandline.py", line 618, in make_exon_junction_adjacencies
    exon_junction_adjacencies.detect_exons_from_junctions()
  File "/Users/olga/workspace-git/outrigger/outrigger/index/adjacencies.py", line 324, in detect_exons_from_junctions
    transform=transform)
  File "/Users/olga/anaconda3/envs/outrigger/lib/python3.5/site-packages/gffutils/interface.py", line 853, in update
    db._update_relations()
  File "/Users/olga/anaconda3/envs/outrigger/lib/python3.5/site-packages/gffutils/create.py", line 993, in _update_relations
    fixed, final_strategy = self._do_merge(f, 'merge')
  File "/Users/olga/anaconda3/envs/outrigger/lib/python3.5/site-packages/gffutils/create.py", line 288, in _do_merge
    self._add_duplicate(orig_id, uniqued_feature.id)
  File "/Users/olga/anaconda3/envs/outrigger/lib/python3.5/site-packages/gffutils/create.py", line 360, in _add_duplicate
    (idspecid, newid))
sqlite3.IntegrityError: UNIQUE constraint failed: duplicates.newid
Uncaught exception. Entering post mortem debugging
Running 'cont' or 'step' will restart the program
> /Users/olga/anaconda3/envs/

Expected behavior: Expected outrigger index to complete without error

Actual behavior: Got this error

Versions

$ git log
commit bc9d7505313bf49f59eb5b4a045678ded3ca001f
Author: Olga Botvinnik <olga.botvinnik@gmail.com>
Date:   Wed Jan 4 12:34:12 2017 -0800

    Fix typo in log message

commit 03b2e8682b73a02bd9a74094ea57b24b727671af
Author: Olga Botvinnik <olga.botvinnik@gmail.com>
Date:   Wed Jan 4 12:33:55 2017 -0800

    Add notes about un-parallelizing novel exon finding
$ outrigger --version
outrigger 1.0.0rc1
ghost commented 7 years ago

Description Tested outrigger index in a directory with 18 SJ.out.tab files and got the same error in addition to some dtype warnings. Carried out locally on a Ubuntu 14.04 system with 32 GB of RAM.

Version $ outrigger --version outrigger 1.0.0

Terminal output

(outrigger-env) /.../star_sjout$ outrigger index --sj-out-tab *SJ.out.tab --gtf /.../Mus_musculus.GRCm38.84.gtf
2017-04-13 12:32:41 Creating folder ./outrigger_output ...
2017-04-13 12:32:41     Done.
2017-04-13 12:32:41 Creating folder ./outrigger_output/index ...
2017-04-13 12:32:41     Done.
2017-04-13 12:32:41 Creating folder ./outrigger_output/index/gtf ...
2017-04-13 12:32:41     Done.
2017-04-13 12:32:41 Creating folder ./outrigger_output/junctions ...
2017-04-13 12:32:41     Done.
2017-04-13 12:32:41 Reading SJ.out.files and creating a big splice junction table of reads spanning exon-exon junctions...
/home/hnasko-lab/anaconda2/envs/outrigger-env/lib/python3.5/site-packages/joblib/parallel.py:131: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
  return [func(*args, **kwargs) for func, args, kwargs in self.items]
/home/hnasko-lab/anaconda2/envs/outrigger-env/lib/python3.5/site-packages/joblib/parallel.py:131: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
  return [func(*args, **kwargs) for func, args, kwargs in self.items]
/home/hnasko-lab/anaconda2/envs/outrigger-env/lib/python3.5/site-packages/joblib/parallel.py:131: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
  return [func(*args, **kwargs) for func, args, kwargs in self.items]
/home/hnasko-lab/anaconda2/envs/outrigger-env/lib/python3.5/site-packages/joblib/parallel.py:131: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
  return [func(*args, **kwargs) for func, args, kwargs in self.items]
/home/hnasko-lab/anaconda2/envs/outrigger-env/lib/python3.5/site-packages/joblib/parallel.py:131: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
  return [func(*args, **kwargs) for func, args, kwargs in self.items]
/home/hnasko-lab/anaconda2/envs/outrigger-env/lib/python3.5/site-packages/joblib/parallel.py:131: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
  return [func(*args, **kwargs) for func, args, kwargs in self.items]
/home/hnasko-lab/anaconda2/envs/outrigger-env/lib/python3.5/site-packages/joblib/parallel.py:131: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
  return [func(*args, **kwargs) for func, args, kwargs in self.items]
/home/hnasko-lab/anaconda2/envs/outrigger-env/lib/python3.5/site-packages/joblib/parallel.py:131: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
  return [func(*args, **kwargs) for func, args, kwargs in self.items]
2017-04-13 12:33:19 Writing ./outrigger_output/junctions/reads.csv ...

2017-04-13 12:33:37     Done.
2017-04-13 12:33:37 Filtering for only junctions with minimum 10 reads ...
2017-04-13 12:33:44     134088/451740 junctions remain after filtering out 317652 junctions with < 10 reads.
2017-04-13 12:33:44     Done.
2017-04-13 12:33:44 Creating splice junction metadata of merely where junctions start and stop
2017-04-13 12:33:45     Done.
2017-04-13 12:33:45 Writing metadata of junctions to ./outrigger_output/junctions/metadata.csv ...
2017-04-13 12:33:46     Done.
2017-04-13 12:33:46 Found GTF file in /home/hnasko-lab/Documents/genomes/Mus_musculus.GRCm38.84.gtf
2017-04-13 12:33:46 Creating a "gffutils" database ./outrigger_output/index/gtf/Mus_musculus.GRCm38.84.gtf.db ...
2017-04-13 12:42:11,733 - INFO - Committing changes: 1589000 features
INFO:gffutils.create:Committing changes
2017-04-13 12:42:20,565 - INFO - Populating features table and first-order relations: 1589641 features
INFO:gffutils.create:Populating features table and first-order relations: 1589641 features
2017-04-13 12:42:20,566 - INFO - Creating relations(parent) index
INFO:gffutils.create:Creating relations(parent) index
2017-04-13 12:42:22,698 - INFO - Creating relations(child) index
INFO:gffutils.create:Creating relations(child) index
2017-04-13 12:42:25,457 - INFO - Creating features(featuretype) index
INFO:gffutils.create:Creating features(featuretype) index
2017-04-13 12:42:26     Done.
2017-04-13 12:42:26     Looking up which exons are already defined ...
2017-04-13 12:42:27         Done.
2017-04-13 12:42:27 Detecting de novo exons based on gaps between junctions ...
2017-04-13 12:42:27     Finding all exons on chromosome 1 ...
2017-04-13 12:43:22         Done.
2017-04-13 12:43:22         Filtering for only novel exons on chromosome 1 ...
2017-04-13 12:43:22             Done.
2017-04-13 12:43:22         Creating gffutils.Feature objects for each novel exon, plus potentially its overlapping gene
2017-04-13 12:43:25             Done.
2017-04-13 12:43:25         Updating gffutils database with 57 novel exons on chromosome 1 ...
2017-04-13 12:44:205%)              Done.
2017-04-13 12:44:20     Finding all exons on chromosome 10 ...
2017-04-13 12:44:59         Done.
2017-04-13 12:44:59         Filtering for only novel exons on chromosome 10 ...
2017-04-13 12:44:59             Done.
2017-04-13 12:44:59         Creating gffutils.Feature objects for each novel exon, plus potentially its overlapping gene
2017-04-13 12:45:09             Done.
2017-04-13 12:45:09         Updating gffutils database with 86 novel exons on chromosome 10 ...
2017-04-13 15:01:105%)              Done.
2017-04-13 15:01:10     Finding all exons on chromosome 11 ...
2017-04-13 15:04:05         Done.
2017-04-13 15:04:05         Filtering for only novel exons on chromosome 11 ...
2017-04-13 15:04:05             Done.
2017-04-13 15:04:05         Creating gffutils.Feature objects for each novel exon, plus potentially its overlapping gene
2017-04-13 15:04:20             Done.
2017-04-13 15:04:20         Updating gffutils database with 130 novel exons on chromosome 11 ...
Traceback (most recent call last):
  File "/home/hnasko-lab/anaconda2/envs/outrigger-env/lib/python3.5/site-packages/gffutils/create.py", line 981, in _update_relations
    self._insert(f, c)
  File "/home/hnasko-lab/anaconda2/envs/outrigger-env/lib/python3.5/site-packages/gffutils/create.py", line 510, in _insert
    cursor.execute(constants._INSERT, feature.astuple())
sqlite3.IntegrityError: UNIQUE constraint failed: features.id

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/hnasko-lab/anaconda2/envs/outrigger-env/bin/outrigger", line 11, in <module>
    sys.exit(main())
  File "/home/hnasko-lab/anaconda2/envs/outrigger-env/lib/python3.5/site-packages/outrigger/commandline.py", line 980, in main
    cl = CommandLine(sys.argv[1:])
  File "/home/hnasko-lab/anaconda2/envs/outrigger-env/lib/python3.5/site-packages/outrigger/commandline.py", line 307, in __init__
    self.args.func()
  File "/home/hnasko-lab/anaconda2/envs/outrigger-env/lib/python3.5/site-packages/outrigger/commandline.py", line 311, in index
    index.execute()
  File "/home/hnasko-lab/anaconda2/envs/outrigger-env/lib/python3.5/site-packages/outrigger/commandline.py", line 705, in execute
    metadata, db)
  File "/home/hnasko-lab/anaconda2/envs/outrigger-env/lib/python3.5/site-packages/outrigger/commandline.py", line 576, in make_exon_junction_adjacencies
    exon_junction_adjacencies.detect_exons_from_junctions()
  File "/home/hnasko-lab/anaconda2/envs/outrigger-env/lib/python3.5/site-packages/outrigger/index/adjacencies.py", line 227, in detect_exons_from_junctions
    transform=transform)
  File "/home/hnasko-lab/anaconda2/envs/outrigger-env/lib/python3.5/site-packages/gffutils/interface.py", line 827, in update
    db._update_relations()
  File "/home/hnasko-lab/anaconda2/envs/outrigger-env/lib/python3.5/site-packages/gffutils/create.py", line 983, in _update_relations
    fixed, final_strategy = self._do_merge(f, 'merge')
  File "/home/hnasko-lab/anaconda2/envs/outrigger-env/lib/python3.5/site-packages/gffutils/create.py", line 288, in _do_merge
    self._add_duplicate(orig_id, uniqued_feature.id)
  File "/home/hnasko-lab/anaconda2/envs/outrigger-env/lib/python3.5/site-packages/gffutils/create.py", line 360, in _add_duplicate
    (idspecid, newid))
sqlite3.IntegrityError: UNIQUE constraint failed: duplicates.newid
olgabot commented 7 years ago

Here's more information on the commands and output:

https://gist.github.com/olgabot/f51b795b62c71f2b2cdb8cd586bdaef4

I'm working on a fix and we'll see if it will work. Otherwise, I think this will be fixed by revamping the command line inputs to be more explicit (https://github.com/YeoLab/outrigger/issues/78) and avoid clashing between databases