GT4SD / gt4sd-core

GT4SD, an open-source library to accelerate hypothesis generation in the scientific discovery process.
https://gt4sd.github.io/gt4sd-core/
MIT License
333 stars 69 forks source link

Test data 'qm9.h5' and files could not be found #221

Closed ylyzz21 closed 1 year ago

ylyzz21 commented 1 year ago

Is your feature request related to a problem? Please describe.

Hello, Thank you so much for your effort for this repo. I have encountered such a problem at runtime, hope to get an answer.

When I run examples/gflownet/example_qm9.py to understand the use of gflownet, I found that there is no GFN/qm9.h5 file under this repo, which makes me unable to know what format the input data should be, and there is no way to carry out this case test.

Similar things happen when running RT, I can't find a tokenizer described in the sample code, I think I need an example of this, to know how to create my own tokenizer. gt4sd-trainer --training_pipeline_name regression-transformer-trainer --tokenizer_name ~/.gt4sd/algorithms/conditional_generation/RegressionTransformer/RegressionTransformerMolecules/qed --config_name examples/regression_transformer/rt_config.json --do_train --output_dir my_regression_transformer --train_data_path src/gt4sd/training_pipelines/tests/regression_transformer_raw.csv --test_data_path src/gt4sd/training_pipelines/tests/regression_transformer_raw.csv --overwrite_output_dir --eval_steps 200 --augment 5 --eval_accumulation_steps 1

Describe the solution you'd like I really hope to see the data directory under the gt4sd-core folder, which contains all the data files mentioned in the sample code, especially the qm9.h5 and ~/.gt4sd/algorithms/conditional_generation/RegressionTransformer/RegressionTransformerMolecules/qed tokenizer

Describe alternatives you've considered I thought about inverting a data hierarchy by using the h5 file described in dataset.py, but this seems too stupid. So I still hope that the authors can provide a corresponding file directly, saving us too many detours.

Additional context Thanks again.

jannisborn commented 1 year ago

Hi @ylyzz21, Thanks for the interest in the repo. In general, GT4SD uses a cloud object storage to store artifacts such as pretrained models or tokenizers, similar to HuggingFace. Once you execute a certain model, a download is triggered and the artifacts are synced to your local cache which defaults to ~/.gt4sd.

Regarding your specific problems:

jannisborn commented 1 year ago

Closing the issue @ylyzz21, let us know if you need further information