Accenture / AmpliGraph

Python library for Representation Learning on Knowledge Graphs https://docs.ampligraph.org
Apache License 2.0
2.14k stars 251 forks source link

Train and Test Data split error #274

Closed anonimoustt closed 10 months ago

anonimoustt commented 10 months ago

Hi,

I was trying to split the following protein data: [ ['0', ' associated', ' WTGMESEEENK'] ['1', ' associated', ' ILSDPSDDTKG'] ['2', ' associated', ' VRLIPSWTTVI'] ['3', ' associated', ' PQTSPSPKRAT'] ['4', ' associated', ' PPLVGTYNTLL'] ['5', ' associated', ' QRLIQSHPESA'] ['6', ' associated', ' EKLALYVYEYL'] ['7', ' associated', ' GSRSRTPSLPT'] ['8', ' associated', ' IDLPMSPRTLD'] ['9', ' associated', ' LRIICSHEHYV'] ['10', ' associated', ' IKEDVYLSHDH'] ['11', ' associated', ' NRNGDTCVTLL'] ['12', ' associated', ' SAEMKSAALEE'] ['13', ' associated', ' LSDSLSGSSLY'] ['14', ' associated', ' LPRASSLNENV'] ['15', ' associated', ' NSGDFYDLYGG'] ['16', ' associated', ' SVNPEYFSAAD'] ['17', ' associated', ' NPGLETHRKRK'] ['18', ' associated', ' EVFDFSQRQKD'] ['19', ' associated', ' FKRQLSLRINE'] ['20', ' associated', ' ASPSNSCQDST'] ['21', ' associated', ' EDRFLTPGRAQ'] ['22', ' associated', ' LSRVDSTTCLF'] ]

Using

from ampligraph.evaluation import train_test_split_no_unseen

X_train, X_test = train_test_split_no_unseen(np.array(data33), test_size=10)#,allow_duplication=True)

Even I tried setting allow_duplication=True it is showing error

Exception: Cannot create a test split of the desired size. Some entities will not occur in both training and test set. Set allow_duplication=True,remove filter on test predicates or set test_size to a smaller value.

Once I set allow_duplication=True I got the following error: ValueError: 'a' cannot be empty unless no samples are taken

Would you please help to resolve this?

albernar commented 10 months ago

Hello! This is actually not a bug, but an intended behaviour. Indeed, it may happen (especially when the dataset is too small or the test_size is too big w.r.t. the dataset) that you cannot split triples in such a way that you have all the same entities present in the training set also in the test set. Most of KGE methods are transductive, which means that entities in the test set have to be a subset of the entities of the training set and when train_test_split_no_unseen raise that error is precisely because, given those triples and that test_size, you cannot create train and test sets with overlapping entities.

albernar commented 10 months ago

For what it concerns the allow_duplication=True case, that argument is used to allow duplicating triples within the test set at random, but since there is not even a triple in the test set (since all your triples have different subjects), there is nothing to duplicate (and that's where your error comes from).

I hope this helps. If not, feel free to reopen.