explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.63k stars 4.36k forks source link

Is there any other way to train NER for specific domain data (medical trails) ??? #2008

Closed UtkarshKhare closed 6 years ago

UtkarshKhare commented 6 years ago

@honnibal @ines

Hi, I am currently working with clinical trails data to extract entities from the text. I am doing good when training my model in below pattern. Currently training with below shown pattern-- TRAIN_DATA = [ ('NCT00000281 2 Completed Yale University Behavioral Please contact site for information.', { 'entities': [(14,23, 'Status'),(24, 39, 'source'),(40, 50, 'I_Type')] }) ]

Test data--- Test data=[ ('NCT00000477 2 Completed National Heart, Lung, and Blood Institute (NHLBI) Drug BACKGROUND: Circulating levels of cholesterol, specifically cholesterol associated with the low-density lipoprotein (LDL) fraction, have been established by observational epidemiologic studies and by metabolic, pathologic, genetic studies in humans and selected animal models, and by randomized clinical trials as a major etiologic factor in coronary heart disease. The ratio between the percent reduction in coronary heart disease incidence and the percent reduction in cholesterol levels associated with treatment in randomized trials, approximately 2:1, is almost exactly that predicted by numerous observational epidemiologic studies of this relationship. However, the clinical trials demonstrating that lowering LDL-cholesterol levels reduces subsequent incidence of coronary heart disease events have been confined by and large to middle-aged men with hypercholesterolemia as in the Lipid Research Clinics Coronary Primary Prevention Trial (LRC-CPPT) or to men with established coronary heart disease as in the Coronary Drug Project (CDP). Experimental confirmation that cholesterol-lowering treatment is worthwhile after as well as before age 60 is lacking. Thus, although the guidelines issued in October 1987 by the National Cholesterol Education Programs (NCEP) Expert Panel on Detection, Evaluation, and Treatment of High Blood Cholesterol in Adults did not discriminate explicitly by age, the absence of direct evidence of efficacy led them to allow room for physician judgment in applying their recommendations to older patients. This uncertainty in the application of the NCEP guidelines to older men and women is a matter of considerable consequence to the public health. Epidemi', ]

Output from test data---- Completed Status National Heart, Lung, and Blood Institute (NHLBI) source Drug I_Type heart disease Condition percent reduction Condition

Here for training, i have to pass whole sentence and then count for starting and ending off-sets for each entity that i want Spacy NER to recognize like (NCT00000281 2 Completed Yale---(14,23, 'Status')). Its not the best way when we are dealing with large data for training. Existing solution is working well and identifies all entities from test data which are trained in training data.

But, what i want Spacy ner is to identify entity from training LABELS which have list of specific drug name, disease type etc. for example, source= [ 'National Center for Research Resources (NCRR) ', 'Masonic Cancer Center, University of Minnesota ', 'Stony Brook University ', ]

condition= [ 'Congenital Adrenal Hyperplasia', 'Lead Poisoning ', 'Cancer ', 'Rheumatic Diseases ', 'Heart Defects, Congenital ', ]

intervention_type=[ 'Drug ', 'Procedure ', 'Biological', ]

Is this possible with Spacy ner to train entities like this and when test data is tested and then it will memorize from training model ('en_core_web_sm') and give output same as above (which is achieved by passing offsets in sentences) ??? I want model to be trained from labels and give same output when tested.

honnibal commented 6 years ago

If you want to provide a list of phrases, and have those phrases recognised in text on a rule-based basis, have a look at the Matcher component: https://spacy.io/usage/linguistic-features#section-rule-based-matching

If you want to attach labels to text without a specific start or end offset, have a look at the TextCategorizer component: https://spacy.io/usage/training#section-textcat

The named entity recognizer needs start and end offsets to train, because that's what it predicts. Named entity recognition is defined as the task of labelling phrases in context.

UtkarshKhare commented 6 years ago

Thank you for the suggestion.....will surely try this with my solution.

lock[bot] commented 6 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.