Out of CPU memory on large datasets for evaluate_performance - in filter_unseen_entities

sumitpai commented 4 years ago

Description

A user posted on slack that he is getting out of memory error on evaluate performance. on further probing, it looks like it goes out of memory on CPU when we try to filter_unseen_entities. We should find a way to do this check in a memory efficient way as in the current form it is highly inefficient esp. on large datasets. (Figure out if all this can be done directly in the database - right from mapping to filtering)

Also, since the user had used train/test split unseen function provided by AmpliGraph, this check which goes out of memory is redundant. So this check should be optional.

Actual Behavior

Expected Behavior

Steps to Reproduce

sumitpai commented 4 years ago

Workaround

As a workaround/quick fix, until it's fixed on master branch, users can use the following snippet if they encounter the issue with filter_unseen_entities (assuming all the entities in test/filter are present in train):

from ampligraph.datasets import AmpligraphDatasetAdapter, NumpyDatasetAdapter

def evaluate_performance(X, model, filter_triples=None, verbose=False, strict=True, entities_subset=None,
                         corrupt_side='s+o', use_default_protocol=True):

    dataset_handle = None
    # try-except block is mainly to handle clean up in case of exception or manual stop in jupyter notebook
    try:

        if isinstance(X, np.ndarray):
            X_test = X

            dataset_handle = NumpyDatasetAdapter()
            dataset_handle.use_mappings(model.rel_to_idx, model.ent_to_idx)
            dataset_handle.set_data(X_test, "test")
        elif isinstance(X, AmpligraphDatasetAdapter):
            dataset_handle = X
        else:
            msg = "X must be either a numpy array or an AmpligraphDatasetAdapter."
            raise ValueError(msg)

        if filter_triples is not None:
            if isinstance(filter_triples, np.ndarray):
                filter_triples = filter_triples #filter_unseen_entities(filter_triples, model, verbose=verbose, strict=strict)
                dataset_handle.set_filter(filter_triples)
                model.set_filter_for_eval()
            elif isinstance(X, AmpligraphDatasetAdapter):
                if not isinstance(filter_triples, bool):
                    raise Exception('Expected a boolean type')
                if filter_triples is True:
                    model.set_filter_for_eval()
            else:
                raise Exception('Invalid datatype for filter. Expected a numpy array or preset data in the adapter.')

        eval_dict = {'default_protocol': False}

        if use_default_protocol:
            corrupt_side = 's+o'
            eval_dict['default_protocol'] = True

        if entities_subset is not None:
            idx_entities = np.asarray([idx for uri, idx in model.ent_to_idx.items() if uri in entities_subset])
            eval_dict['corruption_entities'] = idx_entities

        eval_dict['corrupt_side'] = corrupt_side

        model.configure_evaluation_protocol(eval_dict)

        ranks = model.get_ranks(dataset_handle)

        model.end_evaluation()

        return np.array(ranks)

    except BaseException as e:
        model.end_evaluation()
        if dataset_handle is not None:
            dataset_handle.cleanup()
        raise e

sumitpai commented 4 years ago

The workaround is implemented: to skip filtering of unseen entities if the user is sure that no unseen entities exist in filter and test set

Accenture / AmpliGraph