about molecule data - Githubissues

ZJL0111 commented 4 months ago

Hi: thanks for your work, it help me a lot on my molecule SFT task. I get a quastion about the molecule-caption data: in paper, you said your molecule data come from mol_instructions and ChEBI-20; however , except the difference of smiles and selfis, i also find in your molecule dataset,the is no task desctription part, for example, in mol_instructions

{ "instruction": "Create a molecule with the structure as the one described.", "input": "The molecule is a natural product found in Picea abies, Citrus unshiu, and other organisms with data available.", "output": "[C][C@H1][C@@H1][Branch2][#Branch1][Ring2][C@H1][Branch2][=Branch1][#C][C@H1][Branch2][=Branch1][#Branch2][C@@H1][Branch1][Ring2][O][Ring1][=Branch1][O][C][C@@H1][C@H1][Branch2][Branch1][N][C@@H1][Branch2][Branch1][#Branch1][C@H1][Branch2][Branch1][C][C@@H1][Branch1][Ring2][O][Ring1][=Branch1][O][C][=C][Branch2][Ring1][Ring2][O][C][=C][C][=Branch1][=N][=C][C][=Branch1][Branch2][=C][Ring1][=Branch1][C][Ring1][#Branch2][=O][O][O][C][=C][C][=Branch1][=N][=C][Branch1][=Branch2][C][=Branch1][Ring2][=C][Ring1][=Branch1][O][C][O][O][O][O][O][O][O][O]", "metadata": { "task": "description-guided molecule design", "split": "train" } }

while in Smol_instruct dataset, i get

{"input": "The molecule is a long-chain fatty acid that is henicosane in which one of the methyl groups has been oxidised to give the corresponding carboxylic acid. It is a straight-chain saturated fatty acid and a long-chain fatty acid. It is a conjugate acid of a henicosanoate.", "output": "CCCCCCCCCCCCCCCCCCCCC(=O)O", "task": "molecule_generation", "split": "test"}

so in your real training, do you use 'instruction' part data like mol_instruct or not? if no, why in the usage part demastration, there is

Give me a molecule that satisfies the conditions outlined in the description: Describe this molecule:

Looking forward to your reply

btyu commented 4 months ago

If I understand it correctly, you are asking why there is no instruction part in the data. Actually, we do have that, but stored in a separate file. If you load the dataset with the huggingface datasets library (see https://huggingface.co/datasets/osunlp/SMolInstruct), the instruction will be contained in the input of every sample.

Please feel free to reach out if anything is not clear enough. Thanks.

btyu commented 3 months ago

Closing this issue due to no further update. Please feel free to reopen it if needed :)

ZJL0111 commented 2 months ago

If I understand it correctly, you are asking why there is no instruction part in the data. Actually, we do have that, but stored in a separate file. If you load the dataset with the huggingface datasets library (see https://huggingface.co/datasets/osunlp/SMolInstruct), the instruction will be contained in the input of every sample.

Please feel free to reach out if anything is not clear enough. Thanks.

thanks！

OSU-NLP-Group / LLM4Chem

about molecule data #4