awslabs / dgl-ke

High performance, easy-to-use, and scalable package for learning large-scale knowledge graph embeddings.
https://dglke.dgl.ai/doc/
Apache License 2.0
1.25k stars 194 forks source link

The dataset name in conf is wrong when users use their own dataset. #85

Closed zheng-da closed 4 years ago

zheng-da commented 4 years ago

Please refer to this for more details: https://github.com/awslabs/dgl-ke/issues/84#issuecomment-622510576

AlexMRuch commented 4 years ago

It may be nicer to add a datetime timestamp (e.g., model_dataset_20200507_hr_mn_sec, where 20200507 is 2020 year, 05 month, 07 day) at the end of the output instead of an incrementing integer. Right now for hyperparameter tuning I have about 20 logs that look about the same I can't easily remember which was the one I ran on Monday (== my best model).

classicsong commented 4 years ago

This could be a good point. I think, you can use --save_path as a workaround now

AlexMRuch commented 4 years ago

Thanks! Glad to hear that it was a helpful suggestion.

On Thu, May 7, 2020, 11:15 AM xiang song(charlie.song) < notifications@github.com> wrote:

This could be a good point. I think, you can use --save_path as a workaround now

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/awslabs/dgl-ke/issues/85#issuecomment-625317508, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFIYWOJL4PQNIPGP5QGEYR3RQLGA5ANCNFSM4MXSB35Q .

AlexMRuch commented 4 years ago

FYI, I also realized that this saves the entity and relation embeddings as FB15k_ComplEx_entity.npy and FB15k_ComplEx_relation.npy instead of with the name of my dataset ("SXSW2018"). I manually updated it for now, but just wanted to give a heads up that it's not just the ckpts folder name that has FB15k.

Also, it a little odd that the folder name is MODEL_DATASET_ITERATION while the entity/relation embeddings are named DATASET_MODEL_TYPE.npy. For consistency, shouldn't they both be either MODEL_DATASET or DATASET_MODEL (preferably the latter -- DATASETMODEL*)?

AlexMRuch commented 4 years ago

Manually changing the folder and entity/relation *.npy names still generates errors with dglke_eval:


amruch@wit:~/graphika/kg$ DGLBACKEND=pytorch dglke_eval \
> --data_path results_SXSW_2018 \
> --data_files entities.tsv relations.tsv all_ctups_10.tsv valid.tsv test.tsv --format udd_hrt \
> --model_name ComplEx \
> --hidden_dim 512 --gamma 128 \
> --mix_cpu_gpu --num_proc 6 --num_thread 5 --gpu 0 1 \
> --batch_size_eval 1024 --neg_sample_size_eval 10000 --eval_percent 20 \
> --model_path /home/amruch/graphika/kg/ckpts/SXSW2018_ComplEx_20200507/
Using backend: pytorch
Reading train triples....
Finished. Read 113336766 train triples.
Reading valid triples....
Finished. Read 5383497 valid triples.
Reading test triples....
Finished. Read 5666839 test triples.
Logs are being recorded at: /home/amruch/graphika/kg/ckpts/SXSW2018_ComplEx_20200507/eval.log
/usr/local/lib/python3.6/dist-packages/dgl/base.py:25: UserWarning: multigraph will be deprecated.DGL will treat all graphs as multigraph in the future.
  warnings.warn(msg, warn_type)
|valid|: 5383497|test|: 5666839
Traceback (most recent call last):
  File "/usr/local/bin/dglke_eval", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.6/dist-packages/dglke/eval.py", line 196, in main
    model = load_model_from_checkpoint(logger, args, n_entities, n_relations, ckpt_path)
  File "/usr/local/lib/python3.6/dist-packages/dglke/train_pytorch.py", line 109, in load_model_from_checkpoint
    model.load_emb(ckpt_path, args.dataset)
  File "/usr/local/lib/python3.6/dist-packages/dglke/models/general_models.py", line 178, in load_emb
    self.entity_emb.load(path, dataset+'_'+self.model_name+'_entity')
  File "/usr/local/lib/python3.6/dist-packages/dglke/models/pytorch/tensor_models.py", line 318, in load
    self.emb = th.Tensor(np.load(file_name))
  File "/home/amruch/.local/lib/python3.6/site-packages/numpy/lib/npyio.py", line 384, in load
    fid = open(file, "rb")
FileNotFoundError: [Errno 2] No such file or directory: '/home/amruch/graphika/kg/ckpts/SXSW2018_ComplEx_20200507/FB15k_ComplEx_entity.npy'```
^^^ It still expects things to be `FB15k`
classicsong commented 4 years ago

What is under /home/amruch/graphika/kg/ckpts/SXSW2018_ComplEx_20200507/ ? It needs the dataset name as prefix in the name of saved embedding. https://github.com/awslabs/dgl-ke/blob/27c9b98c271fbb7445db8679a541543031c8866d/python/dglke/models/general_models.py#L168-L180

AlexMRuch commented 4 years ago

The model defaulted to save my entity and relation embeddings as FB15k_ComplEx_*.npy; however, I changed those files to SXSW_ComplEx_*.npy. I also changed the name of the ckpt folder from mentioning FB15k to mentioning SXSW2018. I did not include anything in --dataset, as in the instructions it doesn't seem to state that you need to when you use user-defined knowledge graph data: https://aws-dglke.readthedocs.io/en/latest/train_user_data.html. In my training command, I never included the --dataset option and my model trained fine (i.e., it loaded the correct training, validation, and testing data that was not the FB15k set:

DGLBACKEND=pytorch dglke_train \
--data_path results_SXSW_2018 \
--data_files entities.tsv relations.tsv all_ctups_10.tsv --format udd_hrt \
--model_name ComplEx \
--max_step 300000 --batch_size 1024 --neg_sample_size 1024 --neg_deg_sample --log_interval 1000 \
--hidden_dim 512 --gamma 128 --lr 0.085 -adv --regularization_coef 1.00E-9 \
--mix_cpu_gpu --num_proc 6 --num_thread 5 --gpu 0 1 --rel_part --async_update --force_sync_interval 1000
AlexMRuch commented 4 years ago

I presumed that --dataset was only used if one wished to use a build-in knowledge graph:

Users can specify one of the [pre-defined] datasets with --dataset option in their tasks.

https://aws-dglke.readthedocs.io/en/latest/train_built_in.html

classicsong commented 4 years ago

I presumed that --dataset was only used if one wished to use a build-in knowledge graph:

Users can specify one of the [pre-defined] datasets with --dataset option in their tasks.

https://aws-dglke.readthedocs.io/en/latest/train_built_in.html

You can also use it to name your own dataset. The embedding file name will also change.

AlexMRuch commented 4 years ago

Ah, gotcha. I'll try that on my next run. If that option allows you to name the dataset, but then also lets you load a built-in dataset, then shouldn't the option be a required parameter to fill in? That's probably the simplest solution to this bug.

On Fri, May 8, 2020, 1:17 AM xiang song(charlie.song) < notifications@github.com> wrote:

I presumed that --dataset was only used if one wished to use a build-in knowledge graph:

Users can specify one of the [pre-defined] datasets with --dataset option in their tasks.

https://aws-dglke.readthedocs.io/en/latest/train_built_in.html

You can also use it to name your own dataset. The embedding file name will also change.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/awslabs/dgl-ke/issues/85#issuecomment-625635365, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFIYWOMRILWP6XZBXW4MGD3RQOIW3ANCNFSM4MXSB35Q .

classicsong commented 4 years ago

For UDD and Raw_UDD, user should provide a dataset name. Fixed in PR #105