reconstruction.py does not handle CSV data input

0xSameer commented 5 years ago

The transitive_closure.py script creates CSV files for both nouns and mammals sets in WordNet. These are used to train the hyperbolic embeddings in embed.py.

If we run the reconstruction.py script on a pretrained model it fails as currently the script only accepts HDF5 input: https://github.com/facebookresearch/poincare-embeddings/blob/61406b1bb180234cd34d9972d8853de2fb1a14f8/reconstruction.py#L41 https://github.com/facebookresearch/poincare-embeddings/blob/61406b1bb180234cd34d9972d8853de2fb1a14f8/reconstruction.py#L42

format = 'hdf5' if dset.endswith('.h5') else 'csv'
dset = load_adjacency_matrix(dset, 'hdf5')

The format 'hdf5' is hard coded in line 42.

In addition, the function being called: load_adjacency_matrix expects an adjacency matrix to be passed as an argument, whereas [noun,mammal]_closure.csv are both a list of edges. To fix this we borrowed code from embed.py:

if dset.endswith('.h5'):
   # existing code (mostly) ...
elif dset.endswith('.csv'):
    adj_temp = {}
    idx, _, _ = load_edge_list(dset, sym)
    for row in idx:
        x = row[0].item()
        y = row[1].item()
        if x in adj_temp:
            adj_temp[x].add(y)
        else:
            adj_temp[x] = {y}
    sample_size = args.sample or len(adj_temp)
    sample = np.random.choice(list(adj_temp.keys()), size=sample_size, replace=False)
    adj = {i: adj_temp[i] for i in sample}
else:
    # do something ...

With this change, we get the same mean and MAP rank as calculated and printed in the training loop.

lematt1991 commented 5 years ago

Yeah I've been planning a fix for this, just haven't gotten around to implementing it. Will get this fixed ASAP.

lematt1991 commented 5 years ago

This should be fixed by #46

facebookresearch / poincare-embeddings

reconstruction.py does not handle CSV data input #36