dgl-ke predict errors #232

aridf closed 2 years ago

aridf commented 2 years ago

I am trying to use dglke_predict. I first ran into the error reported in #209 and tried to implement the suggested fix, which was to install from source. To do this I created a new conda environment and installed from source using the instructions on the installation page. Once I had it installed from source, I tried to run my predict command again, which looks like this:

DGLBACKEND=pytorch dglke_predict \
    --model_path ckpts/RotatE_filename_4/ \
    --format 'h_r_t' \
    --data_files results/path/head.list results/path/rel.list results/path/tail.list \
    --raw_data \
    --entity_mfile results/path/entities.tsv \
    --rel_mfile results/path/relations.tsv \
    --topK 5 \
    --exec_mode 'batch_head'

Now I get this error:

Traceback (most recent call last):
  File "/opt/conda/bin/dglke_predict", line 33, in <module>
    sys.exit(load_entry_point('dglke==0.1.0.dev0', 'console_scripts', 'dglke_predict')())
  File "/opt/conda/lib/python3.9/site-packages/dglke-0.1.0.dev0-py3.9.egg/dglke/", line 215, in main
  File "/opt/conda/lib/python3.9/site-packages/dglke-0.1.0.dev0-py3.9.egg/dglke/models/", line 87, in load_model
KeyError: 'model_name'

Indeed, when I look into the config.json file I see a field "model": "RotatE". I presume this problem has emerged because I trained the model using dglke=0.1.2 and now I'm trying to predict using dglke=0.1.0... I guess this field name was changed?

Then I tried to go back and retrain the using the same command that worked on dlgke-0.1.2:

DGLBACKEND=pytorch dglke_train \
    --data_path results/path --dataset data \
    --data_files entities.tsv relations.tsv all_ctups_30.tsv --format udd_hrt \
    --model_name RotatE \
    --max_step 20000 --batch_size 512 --neg_sample_size 128 --neg_deg_sample --log_interval 100 \
    --hidden_dim 512 --gamma 175 --lr 0.1 -adv --regularization_coef 1.00E-9 \
    --gpu 0 -de

And I get this traceback:

/opt/conda/lib/python3.9/site-packages/dgl/ DGLWarning: Recommend creating graphs by `dgl.graph(data)` instead of `dgl.DGLGraph(data)`.
  return warnings.warn(message, category=category, stacklevel=1)
/opt/conda/lib/python3.9/site-packages/dgl/ DGLWarning: Keyword arguments ['readonly', 'multigraph', 'sort_csr'] are deprecated in v0.5, and can be safely removed in all cases.
  return warnings.warn(message, category=category, stacklevel=1)
|Train|: 2468118
Traceback (most recent call last):
  File "/opt/conda/bin/dglke_train", line 33, in <module>
    sys.exit(load_entry_point('dglke==0.1.0.dev0', 'console_scripts', 'dglke_train')())
  File "/opt/conda/lib/python3.9/site-packages/dglke-0.1.0.dev0-py3.9.egg/dglke/", line 145, in main
  File "/opt/conda/lib/python3.9/site-packages/dglke-0.1.0.dev0-py3.9.egg/dglke/dataloader/", line 410, in create_sampler
  File "/opt/conda/lib/python3.9/site-packages/dgl/contrib/sampling/", line 683, in __init__
    self._sampler = _CAPI_CreateUniformEdgeSampler(
  File "dgl/_ffi/_cython/./function.pxi", line 287, in dgl._ffi._cy3.core.FunctionBase.__call__
  File "dgl/_ffi/_cython/./function.pxi", line 232, in dgl._ffi._cy3.core.FuncCall
  File "dgl/_ffi/_cython/./base.pxi", line 155, in dgl._ffi._cy3.core.CALL
dgl._ffi.base.DGLError: [21:13:02] /opt/dgl/include/dgl/packed_func_ext.h:117: Check failed: ObjectTypeChecker<TObjectRef>::Check(sptr.get()): Expected type graph.Graph but get graph.HeteroGraph
Stack trace:
  [bt] (0) /opt/conda/lib/python3.9/site-packages/dgl/ [0x7fb7b9afa03f]
  [bt] (1) /opt/conda/lib/python3.9/site-packages/dgl/ dgl::runtime::DGLArgValue::AsObjectRef<dgl::GraphRef>() const+0x264) [0x7fb7b9c7e4a4]
  [bt] (2) /opt/conda/lib/python3.9/site-packages/dgl/ [0x7fb7ba293690]
  [bt] (3) /opt/conda/lib/python3.9/site-packages/dgl/ [0x7fb7ba294264]
  [bt] (4) /opt/conda/lib/python3.9/site-packages/dgl/ [0x7fb7ba1edd48]
  [bt] (5) /opt/conda/lib/python3.9/site-packages/dgl/_ffi/_cy3/ [0x7fb7b96d9d1e]
  [bt] (6) /opt/conda/lib/python3.9/site-packages/dgl/_ffi/_cy3/ [0x7fb7b96da24b]
  [bt] (7) /opt/conda/bin/python3(_PyObject_MakeTpCall+0x37f) [0x559226c3769f]
  [bt] (8) /opt/conda/bin/python3(_PyEval_EvalFrameDefault+0x4c6) [0x559226cbd8b6]

Which I'm having trouble deciphering... Please advise.

classicsong commented 2 years ago

Which dgl version you are using? Can you try dlgke-0.1.2 with dgl 0.4.3post2 for both training and prediction?

aridf commented 2 years ago

@classicsong Here's the full output of conda list for the conda environment in which I first received the error. I am running dlgke-0.1.2 and dgl-cuda10.1 0.4.3post2 and this produced the error reported in #209 . So, my first attempt described above used these two, if I'm not mistaken.

aridf commented 2 years ago

Possibly related: What should head.list, tail.list, and rel.list look like? Currently these are text files with one entry per line.

aridf commented 2 years ago

@classicsong What version of dgl should I have installed if I'm using the version built from source?

classicsong commented 2 years ago

dgl 0.4.3post2

aridf commented 2 years ago

How can I install this? Per the dgl-ke docs I tried pip3 install dgl==0.4.3 and received the following:

ERROR: Could not find a version that satisfies the requirement dgl==0.4.3
ERROR: No matching distribution found for dgl==0.4.3
classicsong commented 2 years ago

pip install dgl==0.4.3post2 or pip install dgl-cu102==0.4.3post2 (cuxxx depends on your cuda version.

aridf commented 2 years ago

After upgrading pip to version 21.2.4 I now get the following, more informative error when trying to install dgl:

ERROR: Could not find a version that satisfies the requirement dgl==0.4.3 (from versions: 0.1.0, 0.1.2, 0.1.3, 0.6.0, 0.6.0.post1, 0.6.1, 0.7a210406, 0.7a210407, 0.7a210408, 0.7a210409, 0.7a210410, 0.7a210412, 0.7a210413, 0.7a210414, 0.7a210415, 0.7a210416, 0.7a210420, 0.7a210421, 0.7a210422, 0.7a210423, 0.7a210424, 0.7a210425, 0.7a210426, 0.7a210427, 0.7a210429, 0.7a210501, 0.7a210503, 0.7a210506, 0.7a210507, 0.7a210508, 0.7a210511, 0.7a210512, 0.7a210513, 0.7a210514, 0.7a210515, 0.7a210517, 0.7a210518, 0.7a210519, 0.7a210520, 0.7a210525, 0.7a210527)
ERROR: No matching distribution found for dgl==0.4.3

It looks like 0.4.3 is not available in pip?

classicsong commented 2 years ago

I can install it through sudo pip3 install dgl==0.4.3post2 locally.

aridf commented 2 years ago

We resolved this by installing dgl==0.4.3.post2 and installing dgl-ke from source, while pinning python version to 3.8. The previous errors were emerging because in my new environment I was using version 3.9