allenai / bilm-tf

Tensorflow implementation of contextualized word representations from bi-directional language models
Apache License 2.0
1.62k stars 452 forks source link

Preparing training data for a domain with many multi keyword token #240

Open sathiyabalu89 opened 3 years ago

sathiyabalu89 commented 3 years ago

How do I prepare the training data if I have many multi word token in domain like chemistry. For example:

1. Original Sentences: "This is a multi word chemical component 3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyl tetrazolium bromide. \n This is another sentence."

Here "3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyl tetrazolium bromide" is a single token. There are multiple words inside the token which are white space separated. This would lead to the above token to be split as 3 tokens: ['3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyl','tetrazolium', 'bromide'].

How can I avoid this? Can I give the input training data in the following format to avoid this?

Training data(1) : List of tokens for each sentences. So the training text file will have list of list tokens.

[['This', 'is', 'a', 'multi', 'word', 'chemical', 'component', '3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyl tetrazolium bromide'], ['This', 'is', 'another', 'sentence.']]

Training data(2): Here I have concatenated the multi keyword token by '|' symbol. "This is a multi word chemical component 3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyl|tetrazolium|bromide. \n This is another sentence." Then I tweak the ELMO code to handle the | symbol and retain them as a single token.

Please guide on the best way to prepare the training data.

gohjiayi commented 3 years ago

Although late, I have provided a response on the same question on StackOverflow here. Hope it helps future developers.