JohnSnowLabs / spark-nlp-workshop

Public runnable examples of using John Snow Labs' NLP for Apache Spark.
Apache License 2.0
1.02k stars 600 forks source link

How to process a large document which has longer text length for NER? #1031

Closed AayushSameerShah closed 1 year ago

AayushSameerShah commented 1 year ago

Discussed in https://github.com/JohnSnowLabs/spark-nlp-workshop/discussions/1028

Originally posted by **AayushSameerShah** June 7, 2023 ## 📝 Brief I am trying to use the NER for healthcare wanting to extract key "disorders" or "diseases" from different articles from the web for my use-case. ## 🧠 The model I have used the "huggingface" model and followed the procedure like given here [JSL Tutorial](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/HuggingFace%20in%20Spark%20NLP%20-%20DistilBertForTokenClassification.ipynb#scrollTo=g_d6TUPRc2O9) to convert the HF model in TF and use in SparkNLP. And now I have the following code: ## 👩🏻‍💻 Code ```python # loading the saved model tokenClassifier_loaded = DistilBertForTokenClassification.load("./{}_spark_nlp".format(MODEL_NAME))\ .setInputCols(["document",'token'])\ .setOutputCol("ner")\ .setCaseSensitive(True)\ .setMaxSentenceLength(512) # Have tried to use this max as possible document_assembler = DocumentAssembler() \ .setInputCol('text') \ .setOutputCol('document') sentence = SentenceDetector()\ .setInputCols(['document'])\ .setOutputCol('sentence') tokenizer = Tokenizer() \ .setInputCols(['sentence']) \ .setOutputCol('token') converter = NerConverter()\ .setInputCols(["document", "token", "ner"])\ .setOutputCol("ner_span") pipeline = Pipeline(stages=[ document_assembler, sentence, tokenizer, tokenClassifier_loaded, converter ]) ``` **Then I have the text**: ``` article = \ """ ample Type / Medical Specialty: Hematology - Oncology Sample Name: Discharge Summary - Mesothelioma - 1 Description: ... """ ``` > The article has **30K+ characters** and with **3K+ words** *(if split by space)*. This is where it gets crazy. When I run the following: ```python data = spark.createDataFrame([[article]]).toDF("text") result = pipeline.fit(data).transform(data) row_list = [{'annotatorType': row.annotatorType, 'begin': row.begin, 'end': row.end, 'result': row.result, 'metadata': row.metadata} for row in result.select('ner_span').take(1)[0][0] ] len(row_list) ``` > Returns only `43` entries for entity detection. ## 🙋🏻‍♂️ The question: I can understand that whole article can't be passed at once, but there has to be some smart way. Since I am *new* in here, I am not sure **whether to split the article in 512 chunks** and pass them one by one or something else. Will anyone please help me here? Thank you, Aayush 🤗
Damla-Gurbaz commented 1 year ago

Hello @AayushSameerShah ,

I'm here to assist you. There seems to be a small error in the pipeline. You have been working at the document level by providing "document" as input to your model, but you should be working at the sentence level by providing your sentences instead. Similarly, for the NerConverter, you should be using "sentence" as input instead of "document". I have prepared a pipeline using one of our license NER models and the results for your better understanding.

🔎 SAMPLE PIPELINE :

documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

tokenizer = Tokenizer()\
  .setInputCols("sentence")\
  .setOutputCol("token")

tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_jsl", "en", "clinical/models")\
  .setInputCols(["token", "sentence"])\
  .setOutputCol("ner")\
  .setCaseSensitive(True)\
  .setMaxSentenceLength(512)

ner_converter = NerConverterInternal()\
  .setInputCols(["sentence","token","ner"])\
  .setOutputCol("ner_chunk")

pipeline =  Pipeline(stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        tokenClassifier,
        ner_converter])

sample_text = """
Mr. ABC is a 60-year-old gentleman who had a markedly abnormal stress test earlier today in my office with severe chest pain after 5 minutes of exercise on the standard Bruce with horizontal ST depressions and moderate apical ischemia on stress imaging only. He required 3 sublingual nitroglycerin in total (please see also admission history and physical for full details).

The patient underwent cardiac catheterization with myself today which showed mild-to-moderate left main distal disease of 30%, moderate proximal LAD with a severe mid-LAD lesion of 99%, and a mid-left circumflex lesion of 80% with normal LV function and some mild luminal irregularities in the right coronary artery with some moderate stenosis seen in the mid to distal right PDA.

I discussed these results with the patient, and he had been relating to me that he was having rest anginal symptoms, as well as nocturnal anginal symptoms, and especially given the severity of the mid left anterior descending lesion, with a markedly abnormal stress test, I felt he was best suited for transfer for PCI. I discussed the case with Dr. X at Medical Center who has kindly accepted the patient in transfer.

CONDITION ON TRANSFER: Stable but guarded. The patient is pain-free at this time.

MEDICATIONS ON TRANSFER:
1. Aspirin 325 mg once a day.
2. Metoprolol 50 mg once a day, but we have had to hold it because of relative bradycardia which he apparently has a history of.
3. Nexium 40 mg once a day.
4. Zocor 40 mg once a day, and there is a fasting lipid profile pending at the time of this dictation. I see that his LDL was 136 on May 3, 2002.
5. Plavix 600 mg p.o. x1 which I am giving him tonight.
"""

df = spark.createDataFrame([[sample_text]]).toDF("text")
result = pipeline.fit(df).transform(df)

🔎RESULTS : 👇🏻

+--------------------------+-----+----+---------------------------+----------+
|chunk                     |begin|end |ner_label                  |confidence|
+--------------------------+-----+----+---------------------------+----------+
|60-year-old               |14   |24  |Age                        |0.94577295|
|gentleman                 |26   |34  |Gender                     |0.9995837 |
|markedly abnormal         |46   |62  |Test_Result                |0.9741745 |
|stress test               |64   |74  |Test                       |0.9996149 |
|today                     |84   |88  |RelativeDate               |0.99974406|
|severe                    |108  |113 |Modifier                   |0.999809  |
|chest pain                |115  |124 |Symptom                    |0.9965067 |
|after 5 minutes           |126  |140 |RelativeTime               |0.9583209 |
|horizontal ST depressions |181  |205 |EKG_Findings               |0.9952421 |
|moderate                  |211  |218 |Modifier                   |0.9997716 |
|apical ischemia           |220  |234 |Heart_Disease              |0.99728715|
|stress imaging            |239  |252 |Test                       |0.99985504|
|He                        |260  |261 |Gender                     |0.99992895|
|3                         |272  |272 |Dosage                     |0.99897873|
|sublingual                |274  |283 |Route                      |0.9997238 |
|nitroglycerin             |285  |297 |Drug_Ingredient            |0.9997858 |
|admission                 |325  |333 |Admission_Discharge        |0.9990115 |
|cardiac catheterization   |398  |420 |Procedure                  |0.9975387 |
|today                     |434  |438 |RelativeDate               |0.9995082 |
|mild-to-moderate          |453  |468 |Modifier                   |0.9996182 |
|left                      |470  |473 |Direction                  |0.96920437|
|main                      |475  |478 |Internal_organ_or_component|0.18351443|
|distal                    |480  |485 |Direction                  |0.9167719 |
|disease                   |487  |493 |Disease_Syndrome_Disorder  |0.51986647|
|moderate                  |503  |510 |Modifier                   |0.99935615|
|proximal                  |512  |519 |Direction                  |0.9985382 |
|LAD                       |521  |523 |Internal_organ_or_component|0.9988713 |
|severe                    |532  |537 |Modifier                   |0.99965113|
|mid-LAD                   |539  |545 |Direction                  |0.9281402 |
|lesion                    |547  |552 |Symptom                    |0.9457182 |
|mid-left                  |568  |575 |Direction                  |0.6427324 |
|circumflex lesion         |577  |593 |Symptom                    |0.9425712 |
|normal                    |607  |612 |Test_Result                |0.7443403 |
|LV                        |614  |615 |Internal_organ_or_component|0.7672348 |
|function                  |617  |624 |Test                       |0.592596  |
|mild                      |635  |638 |Modifier                   |0.999859  |
|luminal irregularities    |640  |661 |Symptom                    |0.9989077 |
|right                     |670  |674 |Direction                  |0.8462074 |
|coronary artery           |676  |690 |Internal_organ_or_component|0.98542917|
|moderate                  |702  |709 |Modifier                   |0.99986166|
|stenosis                  |711  |718 |Disease_Syndrome_Disorder  |0.66917396|
|mid                       |732  |734 |Direction                  |0.9996561 |
|distal                    |739  |744 |Direction                  |0.9997109 |
|right                     |746  |750 |Direction                  |0.9980184 |
|PDA                       |752  |754 |Internal_organ_or_component|0.9953032 |
|he                        |806  |807 |Gender                     |0.99994254|
|he                        |838  |839 |Gender                     |0.99993706|
|rest                      |852  |855 |Modifier                   |0.48786902|
|anginal symptoms          |857  |872 |Symptom                    |0.9804443 |
|nocturnal                 |886  |894 |Modifier                   |0.8196311 |
|anginal symptoms          |896  |911 |Symptom                    |0.9273863 |
|mid                       |955  |957 |Direction                  |0.98741364|
|left                      |959  |962 |Direction                  |0.86374986|
|anterior descending lesion|964  |989 |Symptom                    |0.87919575|
|markedly abnormal         |999  |1015|Test_Result                |0.9932798 |
|stress test               |1017 |1027|Test                       |0.99903107|
|he                        |1037 |1038|Gender                     |0.9999273 |
|PCI                       |1073 |1075|Procedure                  |0.99704456|
|Medical Center            |1113 |1126|Clinical_Dept              |0.9993375 |
|CONDITION ON TRANSFER:    |1178 |1199|Section_Header             |0.89329183|
|pain-free                 |1236 |1244|Symptom                    |0.9997884 |
|MEDICATIONS ON TRANSFER:  |1261 |1284|Section_Header             |0.9833642 |
|Aspirin                   |1289 |1295|Drug_Ingredient            |0.88113344|
|325 mg                    |1297 |1302|Strength                   |0.9997001 |
|once a day                |1304 |1313|Frequency                  |0.9996945 |
|Metoprolol                |1319 |1328|Drug_BrandName             |0.6567626 |
|50 mg                     |1330 |1334|Strength                   |0.99977183|
|once a day                |1336 |1345|Frequency                  |0.9996777 |
|relative                  |1386 |1393|Modifier                   |0.9833679 |
|bradycardia               |1395 |1405|VS_Finding                 |0.94523096|
|he                        |1413 |1414|Gender                     |0.99992585|
|Nexium                    |1448 |1453|Drug_BrandName             |0.998147  |
|40 mg                     |1455 |1459|Strength                   |0.9997608 |
|once a day                |1461 |1470|Frequency                  |0.99971056|
|Zocor                     |1476 |1480|Drug_BrandName             |0.9998432 |
|40 mg                     |1482 |1486|Strength                   |0.9996531 |
|once a day                |1488 |1497|Frequency                  |0.9995396 |
|fasting lipid profile     |1515 |1535|Test                       |0.99982995|
|his                       |1587 |1589|Gender                     |0.99994373|
|LDL was 136               |1591 |1601|LDL                        |0.8410518 |
|May                       |1606 |1608|Date                       |0.99915946|
|2002                      |1613 |1616|Date                       |0.9991523 |
|Plavix                    |1622 |1627|Drug_BrandName             |0.9998543 |
|600 mg                    |1629 |1634|Strength                   |0.99934137|
|p.o                       |1636 |1638|Frequency                  |0.9889002 |
|.                         |1639 |1639|Frequency                  |0.97397804|
|x1                        |1641 |1642|Frequency                  |0.9974308 |
|him                       |1662 |1664|Gender                     |0.99980974|
|tonight                   |1666 |1672|RelativeDate               |0.95218736|
+--------------------------+-----+----+---------------------------+----------+

➮ As you can see, our model detected a total of 89 entities in our sample text.

I hope these explanations and the pipeline have been helpful to you. Have a great day!☘️

AayushSameerShah commented 1 year ago

Thank you very much for your amazing response @Damla-Gurbaz 🤗 I also have one query if you can have a look: My question #1052

I would really appreciate your response. It is clean and so comprehensive 💯 Thank you 🙏🏻