TobiWeller / RDF2Vec-pytorch

PyTorch Implementation of RDF2Vec
Creative Commons Attribution Share Alike 4.0 International
8 stars 3 forks source link

Possible issue with samples for training #3

Open fjben opened 2 years ago

fjben commented 2 years ago

Hello @TobiWeller,

First of all, thank you for sharing this implementation! I'm observing some unexpected behaviour, possibly a bug, if you could check. Any help would be appreciated. Thank you!

Problem description I ran the code in main with no problems but it seems that in the background train() is repeateadly using the same walk from the begging to the end of the train phase. More concretely if I have 48475 extracted walks, in one epoch/iteration the train runs 48475 times as expected but always using the first walk for the first entity present in the walks list of lists.

I observed the behaviour when checking the _samplebatched in line 161 of Trainer.py. Every sample is some variation of the first walk as previously mentioned. Further checking, it seems that in data_reader.py, the Word2VecDataset nested for loops are using only the first line and the first words of that first line in the data.walks.

Steps to reproduce with minimal code snippet

Haven't changed anything from the original code except _batchsize and iterations, and some print/log debbuging commands not shown here.

`walks_obj = Word2VecWalks('./data/mutag/train.tsv', './data/mutag/test.tsv', 'label_mutagenic')

walks = walks_obj.get_walks('./data/mutag/mutag.owl', {'http://dl-learner.org/carcinogenesis#isMutagenic'}, [['http://dl-learner.org/carcinogenesis#hasBond', 'http://dl-learner.org/carcinogenesis#inBond'], ['http://dl-learner.org/carcinogenesis#hasAtom', 'http://dl-learner.org/carcinogenesis#charge']])

w2v = Word2VecTrainer_Skipgram(walks=walks, batch_size=1, iterations=1, min_count=0)

w2v.train() `

Environment Operating system: Windows 10 Python version: 3.10.2 Torch version: 1.11.0

P.S. If it would be of any help I can send you the debbuging output that led to this.

fjben commented 2 years ago

Hello @TobiWeller,

The issue really does seem to be in the Word2vecDataset class. If you confirm the problem, I have a possible solution that seems to be working for me. Let me know if that may be of use to you.