Performance of train_test_split_no_unseen

kalinin-sanja commented 4 years ago

Thank you for such an amazing tool, documentation, and tutorials! This project helps to shed light on knowledge graph embeddings and evaluation protocol. However, AmpliGraph is a little inconvenient for real-case tasks. In particular, I've faced with the performance issue of train_test_split_no_unseen function. My graph contains 2M nodes, 13 relationships, and 125M triples. The function could not finish after a week of calculations. Is it possible to improve the algorithm? Could it be parallelized? After code review, I've found the usage of the Python dictionary, which has high overhead and could be a reason for low performance. Moreover, there are a lot of calls for np.unique and np.append. Also, it seems that splitting the dataset on the train/valid/test is incorrect. The documentation says that we, first, should split data on X/test and, second, split X on train/valid. But then there is no guarantee that all test samples would present in train because part of them could relocate to valid set.

Best regards.

kalinin-sanja commented 4 years ago

Also, it seems to be a mistake to decrease considered head/relations/tail counts before one can sure they were added to the test set.

sumitpai commented 3 years ago

Closing this. Please follow #242

Accenture / AmpliGraph

Performance of train_test_split_no_unseen #220