Can't load dataset file for semisupervised TU

Ripper346 commented 3 years ago

Hi, I have the problem of https://github.com/Shen-Lab/GraphCL/issues/4#issuecomment-742254458 and #1 trying lunching semisupervised TU pre training. I launch python main.py --dataset MUTAG --aug1 random2 --aug2 random2 --lr 0.001 --suffix 0 --exp test and I get this error:

[INFO] running single test..
-----
Total 1 experiments in this run:
1/1 - MUTAG - deg+odeg100+ak3+reall - ResGFN
Here we go..
-----
1/1 - MUTAG - deg+odeg100+ak3+reall - ResGFN
None None
Traceback (most recent call last):
  File "main.py", line 338, in <module>     
    run_exp_single_test()
  File "main.py", line 316, in run_exp_single_test
    run_exp_lib([('MUTAG', 'deg+odeg100+ak3+reall', 'ResGFN')])
  File "main.py", line 165, in run_exp_lib
    dataset = get_dataset(
  File "C:\Users\alessandro\Developments\GraphCL\semisupervised_TU\pre-training\datasets.py", line 57, in get_dataset
    dataset = TUDatasetExt(
  File "C:\Users\alessandro\Developments\GraphCL\semisupervised_TU\pre-training\tu_dataset.py", line 49, in __init__
    super(TUDatasetExt, self).__init__(root, name, transform, pre_transform,
  File "C:\_______\envs\torch\lib\site-packages\torch_geometric\datasets\tu_dataset.py", line 66, in __init__
    self.data, self.slices = torch.load(self.processed_paths[0])
  File "C:\_______\envs\torch\lib\site-packages\torch\serialization.py", line 579, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "C:\_______\envs\torch\lib\site-packages\torch\serialization.py", line 230, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "C:\_______\envs\torch\lib\site-packages\torch\serialization.py", line 211, in __init__
    super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: 'data\\MUTAG\\MUTAG\\processed\\data_deg+odeg100+ak3+reall.pt'

I have installed all, read the other two issues but I can't understand what I have to do in order to make it work (if there is anything I can do). I have all installed in a python env

yyou1996 commented 3 years ago

Hi @Ripper346,

Thanks for your interest and a big apology for your frustration. The following solutions are I can come with:

Would you mind share your env information that I can double check? This experiment is constructed upon an old repo https://github.com/chentingpc/gfn#requirements with slightly outdated packages, so I understand you may install the required ones but in case there is an oversight.
I notice in the error information that FileNotFoundError: [Errno 2] No such file or directory: 'data\\MUTAG\\MUTAG\\processed\\data_deg+odeg100+ak3+reall.pt'. It looks weird for me that the program concat the path as 'data\\MUTAG\\MUTAG\\processed\\data_deg+odeg100+ak3+reall.pt' rather than 'data\MUTAG\MUTAG\processed\data_deg+odeg100+ak3+reall.pt'. Is there anyway for you to debug this?

Ripper346 commented 3 years ago

I have python 3.8.8 and I use it fine with other projects that use torch and pytorch-geometric, but here is my requirements of my env (a bit long)

alembic==1.5.8
ase==3.21.1
astroid==2.5.1
async-generator==1.10
attrs==20.3.0
autopep8==1.5.5
backcall==0.2.0
bleach==3.3.0
certifi==2020.12.5
chardet==3.0.4
cliff==3.7.0
cmaes==0.8.2
cmd2==1.5.0
colorama==0.4.4
colorlog==4.8.0
control==0.8.4
cvxopt==1.2.6
cycler==0.10.0
Cython==0.29.22
decorator==4.4.2
defusedxml==0.7.0
dgl-cu110==0.6.0
entrypoints==0.3
future==0.18.2
googledrivedownloader==0.4
grakel==0.1.8
graphkit-learn==0.2.0.post1
greenlet==1.0.0
h5py==3.2.0
idna==2.10
ipdb==0.13.5
ipykernel==5.5.0
ipython==7.21.0
ipython-genutils==0.2.0
isodate==0.6.0
isort==5.7.0
jedi==0.18.0
Jinja2==2.11.3
joblib==1.0.1
jsonschema==3.2.0
jupyter-client==6.1.11
jupyter-core==4.7.1
jupyterlab-pygments==0.1.2
kiwisolver==1.3.1
lazy-object-proxy==1.5.2
llvmlite==0.35.0
Mako==1.1.4
mariadb==1.0.6
MarkupSafe==1.1.1
matplotlib==3.3.4
mccabe==0.6.1
mistune==0.8.4
Mosek==9.2.38
mysql-connector-python==8.0.23
nbclient==0.5.3
nbconvert==6.0.7
nbformat==5.1.2
nest-asyncio==1.5.1
networkx==2.5
nose==1.3.7
numba==0.52.0
numpy==1.20.1
optuna==2.7.0
packaging==20.9
pandas==1.2.3
pandocfilters==1.4.3
parso==0.8.1
pbr==5.5.1
pickleshare==0.7.5
Pillow==8.1.1
prettytable==2.1.0
prompt-toolkit==3.0.16
protobuf==3.15.4
pycodestyle==2.6.0
Pygments==2.8.0
pylint==2.7.2
pyparsing==2.4.7
pyperclip==1.8.2
pyreadline3==3.3
pyrsistent==0.17.3
python-dateutil==2.8.1
python-editor==1.0.4
python-louvain==0.15
pytz==2021.1
pywin32==300
PyYAML==5.4.1
pyzmq==22.0.3
rdflib==5.0.0
requests==2.25.1
rope==0.18.0
scikit-learn==0.24.1
scipy==1.6.1
seaborn==0.11.1
six==1.15.0
SQLAlchemy==1.4.7
stevedore==3.3.0
tabulate==0.8.9
testpath==0.4.4
threadpoolctl==2.1.0
toml==0.10.2
torch==1.8.0+cu111
torch-cluster==1.5.9
torch-geometric==1.6.3
torch-scatter==2.0.6
torch-sparse==0.6.9
torch-spline-conv==1.2.1
torchaudio==0.8.0
torchvision==0.9.0+cu111
tornado==6.1
tqdm==4.58.0
traitlets==5.0.5
typing-extensions==3.7.4.3
urllib3==1.26.3
wcwidth==0.2.5
webencodings==0.5.1
wrapt==1.12.1

I am on windows, it is normal that it places two \\ as escaping the backslash

yyou1996 commented 3 years ago

Thank you. I see your env and it is too new (torch_geometric>=1.6.0 rather than the required 1.4.0) for semi_TU repo (please refer to https://github.com/Shen-Lab/GraphCL/tree/master/semisupervised_TU#option-1 for the correct environment).

Another option is that you can try replacing the __init__ function in tu_dataset by:

    url = 'https://ls11-www.cs.tu-dortmund.de/people/morris/' \ 
             'graphkerneldatasets'

    def __init__(self,
                 root,
                 name,
                 transform=None,
                 pre_transform=None,
                 pre_filter=None,
                 use_node_attr=False,
                 processed_filename='data.pt', aug_ratio=None):
        self.name = name
        self.processed_filename = processed_filename

        self.aug = "none"
        self.aug_ratio = None

        super(TUDatasetExt, self).__init__(root, transform, pre_transform,
                                        pre_filter)
        self.data, self.slices = torch.load(self.processed_paths[0])
        if self.data.x is not None and not use_node_attr:
            self.data.x = self.data.x[:, self.num_node_attributes:]

    @property
    def num_node_labels(self):
        if self.data.x is None:
            return 0
        for i in range(self.data.x.size(1)):
            if self.data.x[:, i:].sum().item() == self.data.x.size(0):
                return self.data.x.size(1) - i
        return 0

    @property
    def num_node_attributes(self):
        if self.data.x is None:
            return 0
        return self.data.x.size(1) - self.num_node_labels

    @property
    def raw_file_names(self):
        names = ['A', 'graph_indicator']
        return ['{}_{}.txt'.format(self.name, name) for name in names]

    @property
    def processed_file_names(self):
        return self.processed_filename

    @property
    def num_node_features(self):
        r"""Returns the number of features per node in the dataset."""
        return self[0][0].num_node_features

which might solve the download issue. A new version of this experiment to adapt to torch_geometric>=1.6.0 will also be released in the following weeks.

Ripper346 commented 3 years ago

Ok, thanks I will try on Monday and I will keep you informed in this issue. I think that that behavior is strange, maybe I could look at differences too between torch geometric 1.4 and 1.6

Ripper346 commented 3 years ago

Hi again, so, your code didn't solve the issue, I mitigated something else resulting the class as the following

class TUDatasetExt(TUDataset):
    def __init__(self,
                 root,
                 name,
                 transform=None,
                 pre_transform=None,
                 pre_filter=None,
                 use_node_attr=False,
                 processed_filename='data.pt',
                 aug="none", aug_ratio=None):
        self.name = name
        self.processed_filename = processed_filename

        self.aug = aug
        self.aug_ratio = None

        super(TUDatasetExt, self).__init__(root, self.name, transform, pre_transform,
                                           pre_filter, use_node_attr)
        self.data, self.slices = torch.load(self.processed_paths[0])
        if self.data.x is not None and not use_node_attr:
            self.data.x = self.data.x[:, self.num_node_attributes:]

    @property
    def processed_file_names(self):
        return self.processed_filename

    @property
    def num_node_features(self):
        r"""Returns the number of features per node in the dataset."""
        return self[0][0].num_node_features

    def download(self):
        super().download()

    def get(self, idx):
        data = self.data.__class__()

        if hasattr(self.data, '__num_nodes__'):
            data.num_nodes = self.data.__num_nodes__[idx]

        for key in self.data.keys:
            item, slices = self.data[key], self.slices[key]
            if torch.is_tensor(item):
                s = list(repeat(slice(None), item.dim()))
                s[self.data.__cat_dim__(key,
                                        item)] = slice(slices[idx],
                                                       slices[idx + 1])
            else:
                s = slice(slices[idx], slices[idx + 1])
            data[key] = item[s]

        if self.aug == 'dropN':
            data = drop_nodes(data, self.aug_ratio)
        elif self.aug == 'wdropN':
            data = weighted_drop_nodes(data, self.aug_ratio, self.npower)
        elif self.aug == 'permE':
            data = permute_edges(data, self.aug_ratio)
        elif self.aug == 'subgraph':
            data = subgraph(data, self.aug_ratio)
        elif self.aug == 'maskN':
            data = mask_nodes(data, self.aug_ratio)
        elif self.aug == 'none':
            data = data
        elif self.aug == 'random4':
            ri = np.random.randint(4)
            if ri == 0:
                data = drop_nodes(data, self.aug_ratio)
            elif ri == 1:
                data = subgraph(data, self.aug_ratio)
            elif ri == 2:
                data = permute_edges(data, self.aug_ratio)
            elif ri == 3:
                data = mask_nodes(data, self.aug_ratio)
            else:
                print('sample augmentation error')
                assert False

        elif self.aug == 'random3':
            ri = np.random.randint(3)
            if ri == 0:
                data = drop_nodes(data, self.aug_ratio)
            elif ri == 1:
                data = subgraph(data, self.aug_ratio)
            elif ri == 2:
                data = permute_edges(data, self.aug_ratio)
            else:
                print('sample augmentation error')
                assert False

        elif self.aug == 'random2':
            ri = np.random.randint(2)
            if ri == 0:
                data = drop_nodes(data, self.aug_ratio)
            elif ri == 1:
                data = subgraph(data, self.aug_ratio)
            else:
                print('sample augmentation error')
                assert False

        else:
            print('augmentation error')
            assert False

        return data

It can now download the dataset but it raises again the error of the issue.

Then I tried to install the conda environment of semisupervised TU first but it can't solve some dependencies:

ResolvePackageNotFound:
  - ld_impl_linux-64=2.33.1
  - libffi=3.3
  - readline=8.0
  - libgcc-ng=9.1.0
  - libstdcxx-ng=9.1.0
  - ncurses=6.2
  - libedit=3.1.20191231

I tried with a docker devcontainer, python 3.7 on debian buster with the requirements:

decorator==4.4.2
future==0.18.2
isodate==0.6.0
joblib==0.16.0
networkx==2.4
numpy==1.19.0
pandas==1.0.5
pillow==7.2.0
plyfile==0.7.2
pyparsing==2.4.7
python-dateutil==2.8.1
pytz==2020.1
rdflib==5.0.0
scikit-learn==0.23.1
scipy==1.5.0
six==1.15.0
threadpoolctl==2.1.0

and then installed manually

pip3 install torch==1.4.0 torch-vision==0.5.0 -f https://download.pytorch.org/whl/torch_stable.html
pip3 install torch-scatter==1.1.0 -f https://pytorch-geometric.com/whl/torch-1.4.0.html
pip3 install torch-sparse==0.4.4 -f https://pytorch-geometric.com/whl/torch-1.4.0.html
pip3 install torch-cluster==1.4.5 -f https://pytorch-geometric.com/whl/torch-1.4.0.html
pip3 install torch-spline-conv==1.1.0 -f https://pytorch-geometric.com/whl/torch-1.4.0.html
pip3 install torch-geometric==1.1.0

The installation and run of the original code went fine. I had just to do an adjustment in train_eval.py from r146 I had to place two checks for the logs and models folders to create them if they don't exist.

I noticed that the issue starts facing from torch-geometric 1.4.2, before it doesn't have that problem.

Shen-Lab / GraphCL

Can't load dataset file for semisupervised TU #26