Closed sathiyabalu89 closed 3 years ago
Try chemical tagger:
<Document>
<Sentence>
<VerbPhrase>
<DT>This</DT>
<VBZ>is</VBZ>
</VerbPhrase>
<NounPhrase>
<DT>a</DT>
<JJ>multi</JJ>
<JJ-CHEM>word</JJ-CHEM>
<NN>chemical</NN>
<NN>component</NN>
<MOLECULE>
<OSCARCM>
<OSCAR-CM>3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyl</OSCAR-CM>
<OSCAR-CM>tetrazolium</OSCAR-CM>
<OSCAR-CM>bromide</OSCAR-CM>
</OSCARCM>
</MOLECULE>
</NounPhrase>
<STOP>.</STOP>
</Sentence>
<Sentence>
<VerbPhrase>
<DT>This</DT>
<VBZ>is</VBZ>
</VerbPhrase>
<NounPhrase>
<DT>another</DT>
<NN>sentence</NN>
</NounPhrase>
</Sentence>
</Document>
As for Python, I don't know the answer to that...
Thank you for the reference.
How do I identify multi word tokens in the XML output? Will it always come under
The multi word tokens of a chemical component will probably the child elements of <OSCARCM>
(OSCAR chemical compound). See the OSCAR-Tagger section here
Thank you
Thanks everyone for posting and replying I am also currently using Python as well as Java (Python has easier support for data display and analysis). I think there could be value in converting the Java XML output to a form that Python can consume , e.g. show frequencies in matplotlib. If you do this , Sathiabalu it would be interesting to know.
-- "I always retain copyright in my papers, and nothing in any contract I sign with any publisher will override that fact. You should do the same".
Peter Murray-Rust Reader Emeritus in Molecular Informatics Yusuf Hamied Department of Chemistry University of Cambridge CB2 1EW, UK +44-1223-336432
Thanks everyone for posting and replying I am also currently using Python as well as Java (Python has easier support for data display and analysis). I think there could be value in converting the Java XML output to a form that Python can consume , e.g. show frequencies in matplotlib. If you do this , Sathiabalu it would be interesting to know.
"I always retain copyright in my papers, and nothing in any contract I sign with any publisher will override that fact. You should do the same". Peter Murray-Rust Reader Emeritus in Molecular Informatics Yusuf Hamied Department of Chemistry University of Cambridge CB2 1EW, UK +44-1223-336432
Hi I am trying for various methods to integrate Chemicaltagger to Python. Once I find a working method, will post it here.
I was using OSCAR-CM tags in the output XML to retrieve multi word chemical compound. However the logic fails in few examples. For Ex: "Background to the invention disclose amides of 3,5-diamino-6-halo-pyrazine-2-carboxylic acid of related structure showing ENac ( Epithelial Sodium Channel ) inhibitor activity ."
Here the multi word token "3,5-diamino-6-halo-pyrazine-2-carboxylic acid" comes under multi OSCAR-CM tags consecutively and I am able to club them as a single token. However, 'Epithelial Sodium Channel' is also a multi word token coming under some other tag (mixture). Please let us know on way to retrieve multi word token. Thank you.
On Sat, Dec 26, 2020 at 4:30 PM sathiyabalu89 notifications@github.com wrote:
I was using OSCAR-CM tags in the output XML to retrieve multi word chemical compound. However the logic fails in few examples. For Ex: "Background to the invention disclose amides of 3,5-diamino-6-halo-pyrazine-2-carboxylic acid of related structure showing ENac ( Epithelial Sodium Channel ) inhibitor activity ."
Here the multi word token "3,5-diamino-6-halo-pyrazine-2-carboxylic acid" comes under multi OSCAR-CM tags consecutively and I am able to club them as a single token. However, 'Epithelial Sodium Channel' is also a multi word token coming under some other tag (mixture). Please let us know on way to retrieve multi word token. Thank you.
OSCAR (and Opsin) have been trained on mainstream chemical content, where "Sodium" is either a metal or an ion in a multiword token ("Sodium chloride").
"Sodium channel" is not a chemical compound. "
Note that there are many ambiguities in parsing English:
"Time flies like an arrow"
("Time" == NN), ("flies == VB)
has completely different semantics from
"Fruit flies like a banana"
("Fruit flies" == NNP)
Unless there is a semantic vocabulary with "fruit fly" no parser will get
this right.
Similarly we need a vocabulary with
[
P.
—
You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/BlueObelisk/oscar4/issues/7#issuecomment-751371931, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS6OVVPVXSBALBKWDM3SWYFRFANCNFSM4URZVNSA .
-- "I always retain copyright in my papers, and nothing in any contract I sign with any publisher will override that fact. You should do the same".
Peter Murray-Rust Reader Emeritus in Molecular Informatics Yusuf Hamied Department of Chemistry University of Cambridge CB2 1EW, UK +44-1223-336432
Understood and thank you for the prompt response.
First of all, thank you for making this as an open source. Is there any sample code to depict sentence boundary detection and word tokenization.
For example: Input para: "This is a multi word chemical component 3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyl tetrazolium bromide. This is another sentence."
1. Sentence boundary detection output : ['This is a multi word chemical component 3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyl tetrazolium bromide.', 'This is another sentence.'] 2. Word tokenization output : [['This', 'is', 'a', 'multi', 'word', 'chemical', 'component', '3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyl tetrazolium bromide'], ['This', 'is', 'another', 'sentence.']]
Basically I am trying to do sentence boundary detection and tokenization for chemical documents using Python language. How can I integrate this to a python platform.