Error in generating embeddings from SELFormer pretrained model

JasonJiangs commented 1 year ago

Hi,

I would like to generate embeddings from the pretrained model. I follow the instructions in the readme file, but it report weights not initialized and get stuck at this stage. How do I solve this problem?

Thanks!

MacBook-Pro SELFormer % python3 produce_embeddings.py --selfies_dataset=data/molecule_dataset_selfies.csv --model_file=data/pretrained_models/modelO --embed_file=data/embeddings.csv Some weights of RobertaModel were not initialized from the model checkpoint at ./data/pretrained_models/modelM and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. Starting INFO: Pandarallel will run on 1 workers. INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers. We strongly recommend passing in an attention_mask since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.

atabeyunlu commented 1 year ago

Hi,

Thank you for your interest in our model. When I replicated the steps you took with a clean installation, I didn't face any difficulties. The warning you encountered is not an error and the script works as intended. It's possible that the extended duration you experienced may be attributed to the generation of a large volume of embeddings. Right now, model runs on CPU (or multiple CPUs depending on your choice) that's why it is slow. We will be adding GPU support in coming days. I have turned on an argument in pandarallel, enabling you to monitor progress through a progress bar. You can clone the repo again to update your code.

Should you have any further queries or suggestions, please do not hesitate to let me know. Your feedback is greatly valued.

Best, Atabey

JasonJiangs commented 1 year ago

Thank you!

HUBioDataLab / SELFormer

Error in generating embeddings from SELFormer pretrained model #6