How do I prepare the training data if I have many multi word token in domain like chemistry. For example:
1. Original Sentences: "This is a multi word chemical component 3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyl tetrazolium bromide. \n This is another sentence."
Here "3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyl tetrazolium bromide" is a single token. There are multiple words inside the token which are white space separated. This would lead to the above token to be split as 3 tokens: ['3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyl','tetrazolium', 'bromide'].
How can I avoid this? Can I give the input training data in the following format to avoid this?
Training data(1) : List of tokens for each sentences. So the training text file will have list of list tokens.
Training data(2): Here I have concatenated the multi keyword token by '|' symbol.
"This is a multi word chemical component 3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyl|tetrazolium|bromide. \n This is another sentence." Then I tweak the ELMO code to handle the | symbol and retain them as a single token.
Please guide on the best way to prepare the training data.
How do I prepare the training data if I have many multi word token in domain like chemistry. For example:
1. Original Sentences: "This is a multi word chemical component 3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyl tetrazolium bromide. \n This is another sentence."
Here "3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyl tetrazolium bromide" is a single token. There are multiple words inside the token which are white space separated. This would lead to the above token to be split as 3 tokens: ['3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyl','tetrazolium', 'bromide'].
How can I avoid this? Can I give the input training data in the following format to avoid this?
Training data(1) : List of tokens for each sentences. So the training text file will have list of list tokens.
[['This', 'is', 'a', 'multi', 'word', 'chemical', 'component', '3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyl tetrazolium bromide'], ['This', 'is', 'another', 'sentence.']]
Training data(2): Here I have concatenated the multi keyword token by '|' symbol. "This is a multi word chemical component 3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyl|tetrazolium|bromide. \n This is another sentence." Then I tweak the ELMO code to handle the | symbol and retain them as a single token.
Please guide on the best way to prepare the training data.