Closed JasonJiangs closed 1 year ago
Hi,
Thank you for your interest in our model. When I replicated the steps you took with a clean installation, I didn't face any difficulties. The warning you encountered is not an error and the script works as intended. It's possible that the extended duration you experienced may be attributed to the generation of a large volume of embeddings. Right now, model runs on CPU (or multiple CPUs depending on your choice) that's why it is slow. We will be adding GPU support in coming days. I have turned on an argument in pandarallel, enabling you to monitor progress through a progress bar. You can clone the repo again to update your code.
Should you have any further queries or suggestions, please do not hesitate to let me know. Your feedback is greatly valued.
Best, Atabey
Thank you!
Hi,
I would like to generate embeddings from the pretrained model. I follow the instructions in the readme file, but it report weights not initialized and get stuck at this stage. How do I solve this problem?
Thanks!
MacBook-Pro SELFormer % python3 produce_embeddings.py --selfies_dataset=data/molecule_dataset_selfies.csv --model_file=data/pretrained_models/modelO --embed_file=data/embeddings.csv Some weights of RobertaModel were not initialized from the model checkpoint at ./data/pretrained_models/modelM and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. Starting INFO: Pandarallel will run on 1 workers. INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers. We strongly recommend passing in an
attention_mask
since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.