BlueObelisk / oscar4

OSCAR (Open Source Chemistry Analysis Routines) is an open source extensible system for the automated annotation of chemistry in scientific articles.
Artistic License 2.0
25 stars 4 forks source link

Sample code for sentence boundary detection and word tokenization. #7

Closed sathiyabalu89 closed 3 years ago

sathiyabalu89 commented 3 years ago

First of all, thank you for making this as an open source. Is there any sample code to depict sentence boundary detection and word tokenization.

For example: Input para: "This is a multi word chemical component 3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyl tetrazolium bromide. This is another sentence."

1. Sentence boundary detection output : ['This is a multi word chemical component 3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyl tetrazolium bromide.', 'This is another sentence.'] 2. Word tokenization output : [['This', 'is', 'a', 'multi', 'word', 'chemical', 'component', '3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyl tetrazolium bromide'], ['This', 'is', 'another', 'sentence.']]

Basically I am trying to do sentence boundary detection and tokenization for chemical documents using Python language. How can I integrate this to a python platform.

mjw99 commented 3 years ago

Try chemical tagger:

<Document>
<Sentence>
<VerbPhrase>
<DT>This</DT>
<VBZ>is</VBZ>
</VerbPhrase>
<NounPhrase>
<DT>a</DT>
<JJ>multi</JJ>
<JJ-CHEM>word</JJ-CHEM>
<NN>chemical</NN>
<NN>component</NN>
<MOLECULE>
<OSCARCM>
<OSCAR-CM>3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyl</OSCAR-CM>
<OSCAR-CM>tetrazolium</OSCAR-CM>
<OSCAR-CM>bromide</OSCAR-CM>
</OSCARCM>
</MOLECULE>
</NounPhrase>
<STOP>.</STOP>
</Sentence>
<Sentence>
<VerbPhrase>
<DT>This</DT>
<VBZ>is</VBZ>
</VerbPhrase>
<NounPhrase>
<DT>another</DT>
<NN>sentence</NN>
</NounPhrase>
</Sentence>
</Document>

As for Python, I don't know the answer to that...

sathiyabalu89 commented 3 years ago

Thank you for the reference.

sathiyabalu89 commented 3 years ago

How do I identify multi word tokens in the XML output? Will it always come under tag?

mjw99 commented 3 years ago

The multi word tokens of a chemical component will probably the child elements of <OSCARCM> (OSCAR chemical compound). See the OSCAR-Tagger section here

sathiyabalu89 commented 3 years ago

Thank you

petermr commented 3 years ago

Thanks everyone for posting and replying I am also currently using Python as well as Java (Python has easier support for data display and analysis). I think there could be value in converting the Java XML output to a form that Python can consume , e.g. show frequencies in matplotlib. If you do this , Sathiabalu it would be interesting to know.

-- "I always retain copyright in my papers, and nothing in any contract I sign with any publisher will override that fact. You should do the same".

Peter Murray-Rust Reader Emeritus in Molecular Informatics Yusuf Hamied Department of Chemistry University of Cambridge CB2 1EW, UK +44-1223-336432

sathiyabalu89 commented 3 years ago

Thanks everyone for posting and replying I am also currently using Python as well as Java (Python has easier support for data display and analysis). I think there could be value in converting the Java XML output to a form that Python can consume , e.g. show frequencies in matplotlib. If you do this , Sathiabalu it would be interesting to know.

"I always retain copyright in my papers, and nothing in any contract I sign with any publisher will override that fact. You should do the same". Peter Murray-Rust Reader Emeritus in Molecular Informatics Yusuf Hamied Department of Chemistry University of Cambridge CB2 1EW, UK +44-1223-336432

Hi I am trying for various methods to integrate Chemicaltagger to Python. Once I find a working method, will post it here.

sathiyabalu89 commented 3 years ago

I was using OSCAR-CM tags in the output XML to retrieve multi word chemical compound. However the logic fails in few examples. For Ex: "Background to the invention disclose amides of 3,5-diamino-6-halo-pyrazine-2-carboxylic acid of related structure showing ENac ( Epithelial Sodium Channel ) inhibitor activity ."

Here the multi word token "3,5-diamino-6-halo-pyrazine-2-carboxylic acid" comes under multi OSCAR-CM tags consecutively and I am able to club them as a single token. However, 'Epithelial Sodium Channel' is also a multi word token coming under some other tag (mixture). Please let us know on way to retrieve multi word token. Thank you.

image

petermr commented 3 years ago

On Sat, Dec 26, 2020 at 4:30 PM sathiyabalu89 notifications@github.com wrote:

I was using OSCAR-CM tags in the output XML to retrieve multi word chemical compound. However the logic fails in few examples. For Ex: "Background to the invention disclose amides of 3,5-diamino-6-halo-pyrazine-2-carboxylic acid of related structure showing ENac ( Epithelial Sodium Channel ) inhibitor activity ."

Here the multi word token "3,5-diamino-6-halo-pyrazine-2-carboxylic acid" comes under multi OSCAR-CM tags consecutively and I am able to club them as a single token. However, 'Epithelial Sodium Channel' is also a multi word token coming under some other tag (mixture). Please let us know on way to retrieve multi word token. Thank you.

OSCAR (and Opsin) have been trained on mainstream chemical content, where "Sodium" is either a metal or an ion in a multiword token ("Sodium chloride").

"Sodium channel" is not a chemical compound. " channel" was probably not frequent in the training corpus and so isn't recognised. My analysis is that "sodium" is a nounal adjective. If you are doing a lot of biological chemistry then you probably need to retrain OSCAR.

Note that there are many ambiguities in parsing English: "Time flies like an arrow" ("Time" == NN), ("flies == VB) has completely different semantics from "Fruit flies like a banana" ("Fruit flies" == NNP) Unless there is a semantic vocabulary with "fruit fly" no parser will get this right. Similarly we need a vocabulary with [ or or ] "channel" which recognises these. NLP is never 100%!

P.

You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/BlueObelisk/oscar4/issues/7#issuecomment-751371931, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFTCS6OVVPVXSBALBKWDM3SWYFRFANCNFSM4URZVNSA .

-- "I always retain copyright in my papers, and nothing in any contract I sign with any publisher will override that fact. You should do the same".

Peter Murray-Rust Reader Emeritus in Molecular Informatics Yusuf Hamied Department of Chemistry University of Cambridge CB2 1EW, UK +44-1223-336432

sathiyabalu89 commented 3 years ago

Understood and thank you for the prompt response.