Some confusion about training dataset

LGH1gh / PromptProtein

UPDATE: All future changes will be pushed to https://github.com/HICAI-ZJU/PromptProtein

MIT License

15 stars 0 forks source link

Some confusion about training dataset #1

Open ZwormZ opened 1 year ago

ZwormZ commented 1 year ago

Hi, thanks for sharing this great work. During the pretraining stage, as outlined in the paper, three tasks share parameters. The first task, MLM, is unsupervised and only requires the protein primary structure, i.e., the sequence. In contrast, the other two tasks require structural labels. It is important to understand how the datasets are organized and what sample method is used during the training phase. Can you provide more information on these aspects?

LGH1gh commented 1 year ago

Thank you for your interest in our work. During pre-training, we constructed three datasets corresponding to tasks, and in each batch, we evenly sampled these three datasets.

ZwormZ commented 1 year ago

Thank you for your interest in our work. During pre-training, we constructed three datasets corresponding to tasks, and in each batch, we evenly sampled these three datasets.

Thank you for your reply. I still have some questions about this. Is the size of each dataset the same? If each batch contains data sets for three tasks, should prompt token‘s type be determined to learn the data for different tasks separately during network training? Thanks！

LGH1gh commented 1 year ago

The dataset sizes corresponding to the three tasks are different. And the data in the batch is determined by the pre-training task to use the prompt.