ChangwenXu98 / TransPolymer

Implementation of "TransPolymer: a Transformer-based language model for polymer property predictions" in PyTorch
MIT License
53 stars 19 forks source link

Regarding Egc Dataset #7

Closed liberty-1776 closed 1 year ago

liberty-1776 commented 1 year ago

Hi there! The Egc dataset that you have provided seems to be different from other related datasets(For e.g. Egb,Ei,Xc etc.). The smiles in Egc dataset have these '[]'(square brackets) around symbol but in other datasets the symbol is not followed or preceded by these brackets. And one more thing the result you reported in the paper for Egc, is that using this same Egc dataset or one without square bracket around it? And if this is the case I was wondering why you guys have used square brackets particularly only for Egc datasets since the RoBERTa model is trained on PI1M dataset which does not contain square brackets around * symbol in smiles.

ChangwenXu98 commented 1 year ago

Hi, thanks for raising the issue. I checked the data file Egc.csv and there are no '[]' around '*'. If you mean the tokenization results, which is similar to your previous issue, I printed out the tokens and I do not see '[]' in the tokens either. Let me know if you still have any questions.

liberty-1776 commented 1 year ago

Okk, But earlier when I saw, am pretty sure that the square brackets were there. I might have made mistake in seeing the data!! Thanks