Create a custom VSD - Githubissues

kainszs commented 7 months ago

Hello,

how can I create a VisualGraphDatasets from a Pandas Dataframe containing smiles strings and a target?

Best, Kai

the16thpythonist commented 7 months ago

Hello Kai,

In the future, I intend to add a function to directly create a new dataset from a list structure / dataframe.

Currently, the easiest method is to export the content of the dataframe into a CSV file and convert the dataset by creating a new sub-experiment based on the generate_molecule_dataset_from_csv.py module.

This is a small example of how to create such a sub-experiment module:

import os
import pathlib
import typing as t

from pycomex.functional.experiment import Experiment
from pycomex.utils import folder_path, file_namespace
from visual_graph_datasets.util import EXPERIMENTS_PATH

# == SOURCE PARAMETERS ==
# These parameters determine how to handle the source CSV file of the dataset. There exists the possibility
# to define a file from the local system or to download a file from the VGD remote file share location.
# In this section one also has to determine, for example, the type of the source dataset (regression, 
# classification) and provide the names of the relevant columns in the CSV file.

# :param CSV_FILE_NAME:
#       The name of the CSV file to be used as the source for the dataset conversion.
#       This may be one of the following two things:
#       1. A valid absolute file path on the local system pointing to a CSV file to be used as the source for
#       the VGD conversion
#       2. A valid relative path to a CSV file stashed on the given vgd file share provider which will be
#       downloaded first and then processed.
CSV_FILE_NAME: str = 'path/to/your/file.csv'
# :param SMILES_COLUMN_NAME:
#       This has to be the string name of the CSV column which contains the SMILES string representation of
#       the molecule.
INDEX_COLUMN_NAME: t.Optional[str] = None
# :param TARGET_TYPE:
#       This has to be the string name of the type of dataset that the source file represents. The valid 
#       options here are "regression" and "classification"
SMILES_COLUMN_NAME: str = 'SMILES'
# :param TARGET_COLUMN_NAMES:
#       This has to be a list of string column names within the source CSV file, where each name defines 
#       one column that contains a target value for each row. In the regression case, this may be multiple 
#       different regression targets for each element and in the classification case there has to be one 
#       column per class.
TARGET_COLUMN_NAMES: t.List[str] = ['Solubility']
# :param SPLIT_COLUMN_NAMES:
#       The keys of this dictionary are integers which represent the indices of various train test splits. The
#       values are the string names of the columns which define those corresponding splits. It is expected that
#       these CSV columns contain a "1" if that corresponding element is considered as part of the training set
#       of that split and "0" if it is part of the test set.
#       This dictionary may be empty and then no information about splits will be added to the dataset at all.
SPLIT_COLUMN_NAMES: t.Dict[int, str] = {
}

# == DATASET PARAMETERS ==
# These parameters control aspects of the visual graph dataset creation process. This for example includes 
# the dimensions of the graph visualization images to be created or the name of the visual graph dataset 
# that should be given to the dataset folder.

# :param DATASET_CHUNK_SIZE:
#       This number will determine the chunking of the dataset. Dataset chunking will split the dataset
#       elements into multiple sub folders within the main VGD folder. Especially for larger datasets
#       this should increase the efficiency of subsequent IO operations.
#       If this is None then no chunking will be applied at all and everything will be placed into the
#       top level folder.
DATASET_CHUNK_SIZE: t.Optional[int] = None
# :param DATASET_NAME:
#       The name given to the visual graph dataset folder which will be created.
DATASET_NAME: str = 'custom_dataset'

# == EXPERIMENT PARAMETERS ==

experiment = Experiment.extend(
    os.path.join(EXPERIMENTS_PATH, 'generate_molecule_dataset_from_csv.py'),
    base_path=folder_path(__file__),
    namespace=file_namespace(__file__),
    glob=globals(),
)

experiment.run_if_main()

Executing this experiment module will create a new results folder. Within this results folder the dataset will be created as a folder with the name defined in the DATASET_NAME parameter.

There are additional parameters available that can be modified for the processing of a custom dataset. These parameters are described in the base experiment module generate_molecule_dataset_from_csv.py.

One important detail is that the processing of regression and classification datasets slightly differ. You can view example sub-experiments for each case here:

regression: generate_molecule_dataset_from_csv__aqsoldb.py
classification: generate_molecule_dataset_from_csv__mutagenicity.py

kainszs commented 7 months ago

Hello Jonas, thanks for your quik reply. I encounter the following error, after filling in my information.

Do you have any idea why? Especially this path does not exists in my os: path="C:\\Users\\kspauszu\\.visual_graph_datasets\\config.yaml .

Best, Kai

the16thpythonist commented 7 months ago

Hello Kai,

Yes, the script is trying to access the VGD config file which doesn't exist in your case. You could try to create a config file by running the following command in the shell:

python3 -m visual_graph_datasets.cli config

However, if you supplied a local CSV path the config file should not be necessary at all. The output of the experiment indicates that the CSV file was not found on your local system and that's why the script switched to the fallback option of trying to download the CSV file from the remote file share server.

Are you sure that the path to your CSV file is set correctly?

CSV_FILE_NAME: str = r'C://absolute/path/to/your/file.csv'

best, Jonas

aimat-lab / visual_graph_datasets

Create a custom VSD #1