Thank you for such an amazing tool, documentation, and tutorials! This project helps to shed light on knowledge graph embeddings and evaluation protocol.
However, AmpliGraph is a little inconvenient for real-case tasks. In particular, I've faced with the performance issue of train_test_split_no_unseen function. My graph contains 2M nodes, 13 relationships, and 125M triples. The function could not finish after a week of calculations.
Is it possible to improve the algorithm? Could it be parallelized? After code review, I've found the usage of the Python dictionary, which has high overhead and could be a reason for low performance. Moreover, there are a lot of calls for np.unique and np.append.
Also, it seems that splitting the dataset on the train/valid/test is incorrect. The documentation says that we, first, should split data on X/test and, second, split X on train/valid. But then there is no guarantee that all test samples would present in train because part of them could relocate to valid set.
Thank you for such an amazing tool, documentation, and tutorials! This project helps to shed light on knowledge graph embeddings and evaluation protocol. However, AmpliGraph is a little inconvenient for real-case tasks. In particular, I've faced with the performance issue of
train_test_split_no_unseen
function. My graph contains 2M nodes, 13 relationships, and 125M triples. The function could not finish after a week of calculations. Is it possible to improve the algorithm? Could it be parallelized? After code review, I've found the usage of the Python dictionary, which has high overhead and could be a reason for low performance. Moreover, there are a lot of calls fornp.unique
andnp.append
. Also, it seems that splitting the dataset on thetrain
/valid
/test
is incorrect. The documentation says that we, first, should split data onX
/test
and, second, splitX
ontrain
/valid
. But then there is no guarantee that all test samples would present intrain
because part of them could relocate tovalid
set.Best regards.