GT4SD / gt4sd-core

GT4SD, an open-source library to accelerate hypothesis generation in the scientific discovery process.
https://gt4sd.github.io/gt4sd-core/
MIT License
333 stars 69 forks source link

Training on custom dataset fails #231

Closed SabariKumar closed 10 months ago

SabariKumar commented 10 months ago

Training of regression transformer fails with tensor shape mismatch.

To Reproduce

  1. Install the gpu version of gt4sd-core:
    git clone https://github.com/GT4SD/gt4sd-core.git
    cd gt4sd-core/
    conda env create -f conda_gpu.yml
    conda activate gt4sd
    pip install gt4sd
  2. Cache qed model:
    from gt4sd.algorithms.registry import ApplicationsRegistry
    algorithm = ApplicationsRegistry.get_application_instance(
    target='CCO',
    sampling_wrapper={'property_goal': {'<qed>': 0.12}},
    algorithm_type='conditional_generation',
    domain='materials',
    algorithm_name='RegressionTransformer',
    algorithm_application='RegressionTransformerMolecules',
    algorithm_version='qed'
    )
  3. Run fine tuning per https://github.com/GT4SD/gt4sd-core/tree/main/examples/regression_transformer: gt4sd-trainer --training_pipeline_name regression-transformer-trainer --model_path ~/.gt4sd/algorithms/conditional_generation/RegressionTransformer/RegressionTransformerMolecules/qed --do_train --output_dir /home/sabari/PhotoChem/VerdeDB/regression_transformer --train_data_path /home/sabari/rt_test/train.csv --test_data_path /home/sabari/rt_test/test.csv --overwrite_output_dir --eval_steps 200 --augment 1 --eval_accumulation_steps 1 --num_train_epochs 100

Expected behavior Training script completes successfully

Screenshots Error Stacktrace:

home/sabari/.conda/envs/gt4sd/lib/python3.8/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.     | 0/8 [00:00<?, ?it/s]
  warnings.warn('Was asked to gather along dimension 0, but all '
                                                                                                                             ERROR:gt4sd.training_pipelines.regression_transformer.implementation:Exception occurred while running RegressionTransformerTrainingPipeline.rop0>: 100%|███████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00,  7.33it/s]
Traceback (most recent call last):
  File "/home/sabari/.conda/envs/gt4sd/lib/python3.8/site-packages/gt4sd/training_pipelines/regression_transformer/implementation.py", line 181, in train
    trainer.train(model_path=params["output_dir"])
  File "/home/sabari/.conda/envs/gt4sd/lib/python3.8/site-packages/terminator/trainer.py", line 1088, in train
    self.property_evaluate()
  File "/home/sabari/.conda/envs/gt4sd/lib/python3.8/site-packages/terminator/trainer.py", line 1185, in property_evaluate
    ps, rs, ss = evaluator.property_prediction(
  File "/home/sabari/.conda/envs/gt4sd/lib/python3.8/site-packages/terminator/evaluator.py", line 178, in property_prediction
    logits, label_ids, metrics, input_ids = self.prediction_loop(
  File "/home/sabari/.conda/envs/gt4sd/lib/python3.8/site-packages/terminator/trainer.py", line 656, in prediction_loop
    loss, logits, labels = self.prediction_step(
  File "/home/sabari/.conda/envs/gt4sd/lib/python3.8/site-packages/terminator/trainer.py", line 443, in prediction_step
    outputs = self.feed_model(model, inputs)
  File "/home/sabari/.conda/envs/gt4sd/lib/python3.8/site-packages/terminator/trainer.py", line 322, in feed_model
    outputs = model(**model_inputs)
  File "/home/sabari/.conda/envs/gt4sd/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/sabari/.conda/envs/gt4sd/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 169, in forward
    return self.gather(outputs, self.output_device)
  File "/home/sabari/.conda/envs/gt4sd/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 181, in gather
    return gather(outputs, output_device, dim=self.dim)
  File "/home/sabari/.conda/envs/gt4sd/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 78, in gather
    res = gather_map(outputs)
  File "/home/sabari/.conda/envs/gt4sd/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 69, in gather_map
    return type(out)((k, gather_map([d[k] for d in outputs]))
  File "<string>", line 8, in __init__
  File "/home/sabari/.conda/envs/gt4sd/lib/python3.8/site-packages/transformers/utils/generic.py", line 230, in __post_init__
    for element in iterator:
  File "/home/sabari/.conda/envs/gt4sd/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 69, in <genexpr>
    return type(out)((k, gather_map([d[k] for d in outputs]))
  File "/home/sabari/.conda/envs/gt4sd/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 73, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
  File "/home/sabari/.conda/envs/gt4sd/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
    return Gather.apply(target_device, dim, *outputs)
  File "/home/sabari/.conda/envs/gt4sd/lib/python3.8/site-packages/torch/nn/parallel/_functions.py", line 75, in forward
    return comm.gather(inputs, ctx.dim, ctx.target_device)
  File "/home/sabari/.conda/envs/gt4sd/lib/python3.8/site-packages/torch/nn/parallel/comm.py", line 235, in gather
    return torch._C._gather(tensors, dim, destination)
RuntimeError: Input tensor at index 1 has invalid shape [146, 6, 256], but expected [146, 7, 256]
12:42:46   Exception occurred while running RegressionTransformerTrainingPipeline.
Traceback (most recent call last):
  File "/home/sabari/.conda/envs/gt4sd/lib/python3.8/site-packages/gt4sd/training_pipelines/regression_transformer/implementation.py", line 181, in train
    trainer.train(model_path=params["output_dir"])
  File "/home/sabari/.conda/envs/gt4sd/lib/python3.8/site-packages/terminator/trainer.py", line 1088, in train
    self.property_evaluate()
  File "/home/sabari/.conda/envs/gt4sd/lib/python3.8/site-packages/terminator/trainer.py", line 1185, in property_evaluate
    ps, rs, ss = evaluator.property_prediction(
  File "/home/sabari/.conda/envs/gt4sd/lib/python3.8/site-packages/terminator/evaluator.py", line 178, in property_prediction
    logits, label_ids, metrics, input_ids = self.prediction_loop(
  File "/home/sabari/.conda/envs/gt4sd/lib/python3.8/site-packages/terminator/trainer.py", line 656, in prediction_loop
    loss, logits, labels = self.prediction_step(
  File "/home/sabari/.conda/envs/gt4sd/lib/python3.8/site-packages/terminator/trainer.py", line 443, in prediction_step
    outputs = self.feed_model(model, inputs)
  File "/home/sabari/.conda/envs/gt4sd/lib/python3.8/site-packages/terminator/trainer.py", line 322, in feed_model
    outputs = model(**model_inputs)
  File "/home/sabari/.conda/envs/gt4sd/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/sabari/.conda/envs/gt4sd/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 169, in forward
    return self.gather(outputs, self.output_device)
  File "/home/sabari/.conda/envs/gt4sd/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 181, in gather
    return gather(outputs, output_device, dim=self.dim)
  File "/home/sabari/.conda/envs/gt4sd/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 78, in gather
    res = gather_map(outputs)
  File "/home/sabari/.conda/envs/gt4sd/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 69, in gather_map
    return type(out)((k, gather_map([d[k] for d in outputs]))
  File "<string>", line 8, in __init__
  File "/home/sabari/.conda/envs/gt4sd/lib/python3.8/site-packages/transformers/utils/generic.py", line 230, in __post_init__
    for element in iterator:
  File "/home/sabari/.conda/envs/gt4sd/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 69, in <genexpr>
    return type(out)((k, gather_map([d[k] for d in outputs]))
  File "/home/sabari/.conda/envs/gt4sd/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 73, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
  File "/home/sabari/.conda/envs/gt4sd/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
    return Gather.apply(target_device, dim, *outputs)
  File "/home/sabari/.conda/envs/gt4sd/lib/python3.8/site-packages/torch/nn/parallel/_functions.py", line 75, in forward
    return comm.gather(inputs, ctx.dim, ctx.target_device)
  File "/home/sabari/.conda/envs/gt4sd/lib/python3.8/site-packages/torch/nn/parallel/comm.py", line 235, in gather
    return torch._C._gather(tensors, dim, destination)
RuntimeError: Input tensor at index 1 has invalid shape [146, 6, 256], but expected [146, 7, 256]
Predicting <prop0>: 100%|███████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00,  5.07it/s]
Epoch:   6%|████▉                                                                             | 6/100 [01:27<22:44, 14.52s/it]
Iteration:  22%|█████████████████▎                                                             | 7/32 [00:04<00:17,  1.43it/s]

System (please complete the following information):

Additional context Hello, I'm trying to run fine-tuning training using the QED regression transformer model on a custom dataset. My train/test csvs consist of a single "text" column containing SMILES strings, and a single property column "prop0". Training fails with a tensor shape mismatch in the second (ie., index=1 dimension), regardless of the data augmentation value.

jannisborn commented 10 months ago

Hi @SabariKumar, Thanks for reporting. In the same environment, could you please check whether you can successfully run the training example described here: https://github.com/GT4SD/gt4sd-core/tree/main/examples/regression_transformer#finetuning

This substitutes the train/test path with the test files inside the gt4sd directory. If such a training is successful, I would suspect your error is due to poor data formatting.

jannisborn commented 10 months ago

Also, the first line of the error is:

site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector. 

Maybe you have a single-token molecule like C or similar? It's a bit suspicious that you can complete 6% of the epoch before the error occurs

jannisborn commented 10 months ago

Closing due to inactivity, feel free to comment if issue persists