Duplicate edges in Hetionet

veleritas commented 7 years ago

Hi Daniel,

Just wanted to note that there are still duplicate edges in hetionet in the newest integrate.ipynb. Specifically, the following two types of relationships give duplicate edge errors when the notebook is run:

Disease-gene differential expression edges

commit = '1a11633b5e0095454453335be82012a9f0f482e4'
url = rawgit('dhimmel', 'stargeo', commit, 'data/diffex.tsv')
stargeo_df = pandas.read_table(url)
# Filter to at most 250 up and 250 down-regulated genes per disease
stargeo_df = stargeo_df.groupby(['slim_id', 'direction']).apply(
    lambda df: df.nsmallest(250, 'p_adjusted')).reset_index(drop=True)
stargeo_df.head(2)

for row in stargeo_df.itertuples():
    source_id = 'Disease', row.slim_id
    target_id = 'Gene', row.entrez_gene_id
    kind = row.direction + 'regulates'
    data = {
        'source': 'STARGEO',
        'log2_fold_change': round(row.log2_fold_change, 5),
        'unbiased': True,
        'license': 'CC0 1.0'
    }
    graph.add_edge(source_id, target_id, kind, 'both', data)

LINCS Compound-gene dysregulation edges

url = rawgit('dhimmel', 'lincs', commit, 'data/consensi/signif/dysreg-drugbank.tsv')
l1000_df = pandas.read_table(url)
l1000_df = l1000_df.query("perturbagen in @compound_df.drugbank_id and entrez_gene_id in @coding_genes")
l1000_df = filter_l1000_df(l1000_df, n=125)
l1000_df.tail(2)

mapper = {'up': 'upregulates', 'down': 'downregulates'}
for row in l1000_df.itertuples():
    source_id = 'Compound', row.perturbagen
    target_id = 'Gene', row.entrez_gene_id
    data = {
        'source': 'LINCS L1000',
        'z_score': round(row.z_score, 3),
        'method': row.status,
        'unbiased': True,
    }
    kind = mapper[row.direction]
    graph.add_edge(source_id, target_id, kind, 'both', data)

Also, is the metaedge generation supposed to be exponential with the number of metapaths in the network? I noticed that if I don't include these types of metapaths in the network, but include everything else, then the number of metapaths drops from 1200 to only 130

['Compound', 'Disease', 'palliates', 'both']
['Compound', 'Gene', 'downregulates', 'both']
['Compound', 'Gene', 'upregulates', 'both']
['Disease', 'Gene', 'downregulates', 'both']
['Disease', 'Gene', 'upregulates', 'both']

The four regulation metapaths were not included due to the edge import errors, and the palliates one due to my excluding them for testing purposes.

veleritas commented 7 years ago

At the moment I'm bypassing the error by enclosing the add_edge() call in a try except block and it seems to work fine. Including the up/down regulation edges increased the number of metapaths to ~900, so it does seem to be exponential.

dhimmel commented 7 years ago

Specifically, the following two types of relationships give duplicate edge errors

@veleritas you're getting the AssertionError: edge already exists? I just reinstalled my integrate conda environment and tried out the two metaedges that were giving you trouble. I didn't get any errors. One possibility is that you ran those notebook cells multiple times? Every repeat execution of a cell containing graph.add_edge will now cause an error.

At the moment I'm bypassing the error by enclosing the add_edge() call in a try except block and it seems to work fine.

Hopefully we can diagnose your issue, so you can remove the error handling here.

Also, is the metaedge generation supposed to be exponential with the number of metapaths in the network? I noticed that if I don't include these types of metapaths in the network, but include everything else, then the number of metapaths drops from 1200 to only 130

It's a combinatorial explosion! Not sure if that counts as exponential. The reason the 5 edges you mention have such a huge effect on the total number of possible metapaths is that they connect genes, compounds, and diseases -- which also have lot's of other metaedges. In the future, I could see some heuristic method that only computed DWPCs for metapaths that were likely to provide novel information.

veleritas commented 7 years ago

So I went back to see if I could pin down the reason why we seem to be getting different results. On a fresh Ubuntu 16.04 instance I have confirmed that integrate.ipynb runs just fine without the edge exists AssertionError with the Anaconda environment specified by https://github.com/dhimmel/integrate/blob/master/environment.yml

(I am using Anaconda 4.3.1 for these tests).

However, if you update the packages in the integrate environment through a conda update --all command, then the integrate notebook breaks on the two edge types that I mentioned in the first comment. It seems weird to me that updating Python dependencies would break the integrate code at this point in time, but it seems like this should classify as a bug?

Here's the environment.yml file dump after the conda update command:

name: integrate
channels:
- defaults
dependencies:
- bleach=1.5.0=py35_0
- cycler=0.10.0=py35_0
- dbus=1.10.10=0
- decorator=4.0.11=py35_0
- entrypoints=0.2.2=py35_1
- et_xmlfile=1.0.1=py35_0
- expat=2.1.0=0
- fontconfig=2.12.1=3
- freetype=2.5.5=2
- glib=2.50.2=1
- gst-plugins-base=1.8.0=0
- gstreamer=1.8.0=0
- html5lib=0.999=py35_0
- icu=54.1=0
- ipykernel=4.5.2=py35_0
- ipython=5.3.0=py35_0
- ipython_genutils=0.1.0=py35_0
- ipywidgets=6.0.0=py35_0
- jdcal=1.3=py35_0
- jinja2=2.9.5=py35_0
- jpeg=9b=0
- jsonschema=2.5.1=py35_0
- jupyter=1.0.0=py35_1
- jupyter_client=5.0.0=py35_0
- jupyter_console=5.1.0=py35_0
- jupyter_core=4.3.0=py35_0
- libffi=3.2.1=1
- libgcc=5.2.0=0
- libgfortran=3.0.0=1
- libiconv=1.14=0
- libpng=1.6.27=0
- libsodium=1.0.10=0
- libxcb=1.12=1
- libxml2=2.9.4=0
- markupsafe=0.23=py35_2
- matplotlib=2.0.0=np112py35_0
- mistune=0.7.4=py35_0
- mkl=2017.0.1=0
- nbconvert=5.1.1=py35_0
- nbformat=4.3.0=py35_0
- notebook=4.4.1=py35_0
- numexpr=2.6.2=np112py35_0
- numpy=1.12.1=py35_0
- openssl=1.0.2k=1
- pandas=0.19.2=np112py35_1
- pandocfilters=1.4.1=py35_0
- path.py=10.1=py35_0
- pcre=8.39=1
- pexpect=4.2.1=py35_0
- pickleshare=0.7.4=py35_0
- pip=9.0.1=py35_1
- prompt_toolkit=1.0.13=py35_0
- ptyprocess=0.5.1=py35_0
- pygments=2.2.0=py35_0
- pyparsing=2.1.4=py35_0
- pyqt=5.6.0=py35_2
- python=3.5.3=1
- python-dateutil=2.6.0=py35_0
- pytz=2016.10=py35_0
- pyzmq=16.0.2=py35_0
- qt=5.6.2=3
- qtconsole=4.2.1=py35_1
- readline=6.2=2
- requests=2.13.0=py35_0
- scipy=0.19.0=np112py35_0
- seaborn=0.7.1=py35_0
- setuptools=27.2.0=py35_0
- simplegeneric=0.8.1=py35_1
- sip=4.18=py35_0
- six=1.10.0=py35_0
- sqlite=3.13.0=0
- terminado=0.6=py35_0
- testpath=0.3=py35_0
- tk=8.5.18=0
- tornado=4.4.2=py35_0
- traitlets=4.3.2=py35_0
- wcwidth=0.1.7=py35_0
- wheel=0.29.0=py35_0
- widgetsnbextension=2.0.0=py35_0
- xlsxwriter=0.9.6=py35_0
- xz=5.2.2=1
- zeromq=4.1.5=0
- zlib=1.2.8=3
- pip:
  - et-xmlfile==1.0.1
  - hetio==0.2.3
  - ipython-genutils==0.1.0
  - jupyter-client==5.0.0
  - jupyter-console==5.1.0
  - jupyter-core==4.3.0
  - prompt-toolkit==1.0.13
  - py2neo==2.0.8
  - tqdm==4.11.2
prefix: /home/ubuntu/anaconda3/envs/integrate

dhimmel commented 7 years ago

My guess is that some pandas behavior has changed.

Can you see which rows are duplicated using the following:

l1000_df[l1000_df.duplicated(['perturbagen', 'entrez_gene_id'], keep=False)]
stargeo_df[stargeo_df.duplicated(['slim_id', 'entrez_gene_id'], keep=False)]

It seems weird to me that updating Python dependencies would break the integrate code at this point in time, but it seems like this should classify as a bug?

Version changes frequently break things! If you want to update a dependency for an existing codebase, I'd do it one at a time and carefully. I wouldn't recommend conda update --all in these instances. Different codebases have different comparability needs. For example, dhimmel/hetio targets python 3.4+, but for a scripted analysis like dhimmel/integrate it usually makes sense to pick a single environment and stick to it.

That being said, I'm happy to implement a forward compatible syntax if we can figure out what the bug is.

veleritas commented 7 years ago

I can try to figure out what changed to cause these duplicate edges, but that will probably take a few days as I work through other priorities.

dhimmel commented 7 years ago

I can try to figure out what changed to cause these duplicate edges, but that will probably take a few days as I work through other priorities.

Up to you. The motivation to diagnose it rather than use error handling is the possibility that's it's part of a bigger problem... but if you're getting the expected number of edges, it's probably not a huge issue.

dhimmel / integrate

Duplicate edges in Hetionet #13

Disease-gene differential expression edges

LINCS Compound-gene dysregulation edges