Deprecate outdated doc, templates, and auxiliary files.

xingjian-zhang commented 1 year ago

Description

Remove outdated documents in the root directory.
Remove hjson templates as we do not require contributors to create json files by hand. Rather, we would hide technical details as much as possible.
Remove urls.json in datasets/.

Related Issue

This PR attempts to fix #462, #425, and #398.

Motivation and Context

In general, we would like to migrate all instructions/documents about dataset submission into a same place. Currently, we are moving everything into Wiki. In the long run, we will move everything into the GLI web document.
We would like to deprecate urls.json as we use as we use HTTP request to fetch download urls now.

How Has This Been Tested?

This change does not involve source code change.

github-actions[bot] commented 1 year ago

This is an automatic reminder for pasting the local test results of wiki as a comment in this PR, in case you haven't done so. The aforementioned datasets are too large for them to be tested with GitHub Action workflow here. The local test result for each dataset can be obtained by running make pytest DATASET=<dataset name>. For more details, please refer to the dataset submission guide.

xingjian-zhang commented 1 year ago

This is an automatic reminder for pasting the local test results of wiki as a comment in this PR, in case you haven't done so. The aforementioned datasets are too large for them to be tested with GitHub Action workflow here. The local test result for each dataset can be obtained by running make pytest DATASET=<dataset name>. For more details, please refer to the dataset submission guide.

This is expected as we are modifying all datasets by removing the urls.json.

xingjian-zhang commented 1 year ago

Pytests failed: log. Fails contain two parts:

KeyError: 'predict_tail' for all KGEntityPrediction tasks.
Cannot fetch urls for arxiv-year, snap-patents, and twitch-gamers.

I think the tests failure are triggered by previous code.

github-actions[bot] commented 1 year ago

This is an automatic reminder for pasting the local test results of wiki as a comment in this PR, in case you haven't done so. The aforementioned datasets are too large for them to be tested with GitHub Action workflow here. The local test result for each dataset can be obtained by running make pytest DATASET=<dataset name>. For more details, please refer to the dataset submission guide.

jiaqima commented 1 year ago

I tried to run the following code and successfully got the url for snap-patents.npz:

from gli.utils import _get_url_from_server
print(_get_url_from_server('snap_patents.npz'))

The result is 'https://www.dropbox.com/s/yplq00csa3vyogp/snap_patents.npz?dl=0'.

Maybe the HTTPS server is unstable?

xingjian-zhang commented 1 year ago

I tried to run the following code and successfully got the url for snap-patents.npz:
from gli.utils import _get_url_from_server
print(_get_url_from_server('snap_patents.npz'))
The result is 'https://www.dropbox.com/s/yplq00csa3vyogp/snap_patents.npz?dl=0'.

Maybe the HTTPS server is unstable?

In [1]: from gli.utils import _get_url_from_server

In [2]: from gli import get_gli_graph

In [3]: for i in range(5):
   ...:     print(_get_url_from_server('snap_patents.npz'))
   ...: 
https://www.dropbox.com/s/yplq00csa3vyogp/snap_patents.npz?dl=0
https://www.dropbox.com/s/yplq00csa3vyogp/snap_patents.npz?dl=0
https://www.dropbox.com/s/yplq00csa3vyogp/snap_patents.npz?dl=0
https://www.dropbox.com/s/yplq00csa3vyogp/snap_patents.npz?dl=0
https://www.dropbox.com/s/yplq00csa3vyogp/snap_patents.npz?dl=0

In [4]: get_gli_graph('snap-patents')
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In [4], line 1
----> 1 get_gli_graph('snap-patents')

File ~/Projects/Private/gli/gli/dataloading.py:139, in get_gli_graph(dataset, device, verbose)
    137 if not os.path.exists(metadata_path):
    138     raise FileNotFoundError(f"{metadata_path} not found.")
--> 139 download_data(dataset, verbose=verbose)
    141 return read_gli_graph(metadata_path, device=device, verbose=verbose)

File ~/Projects/Private/gli/gli/utils.py:367, in download_data(dataset, verbose)
    365             data_file_url_dict[data_file] = url_dict[data_file]
    366         else:
--> 367             raise FileNotFoundError(f"cannot find url for {data_file}.")
    369 for data_file_name, url in data_file_url_dict.items():
    370     data_file_path = os.path.join(data_dir, data_file_name)

FileNotFoundError: cannot find url for snap-patents.npz.

I can fetch the url directly by calling _get_url_from_server but failed to fetch it when calling it inside get_gli_graph(). This is unexpected. Let me have a closer look into this issue.

github-actions[bot] commented 1 year ago

This is an automatic reminder for pasting the local test results of wiki as a comment in this PR, in case you haven't done so. The aforementioned datasets are too large for them to be tested with GitHub Action workflow here. The local test result for each dataset can be obtained by running make pytest DATASET=<dataset name>. For more details, please refer to the dataset submission guide.

jiaqima commented 1 year ago

Fixed predict_tail error via #468.

xingjian-zhang commented 1 year ago

Found the bug:

snap_patents.npz exists in remote.
metadata.json uses snap-patents.npz, which does not exist in remote storage.

twitch-gamers and arxiv-year share the same issue. I have temporarily fixed them by modifying corresponding metadata.json manually. This vulnerability would be resolved in the future when we enforce function-based interface to contribute dataset.

github-actions[bot] commented 1 year ago

This is an automatic reminder for pasting the local test results of wiki as a comment in this PR, in case you haven't done so. The aforementioned datasets are too large for them to be tested with GitHub Action workflow here. The local test result for each dataset can be obtained by running make pytest DATASET=<dataset name>. For more details, please refer to the dataset submission guide.

github-actions[bot] commented 1 year ago

This is an automatic reminder for pasting the local test results of wiki as a comment in this PR, in case you haven't done so. The aforementioned datasets are too large for them to be tested with GitHub Action workflow here. The local test result for each dataset can be obtained by running make pytest DATASET=<dataset name>. For more details, please refer to the dataset submission guide.

Graph-Learning-Benchmarks / gli