ckbjimmy / clneg

Clinical Text Summarization with Syntax-Based Negation and Semantic Concept Identification
MIT License
21 stars 8 forks source link

xml parsing issue in tmp.xml file #1

Open itsmemala opened 4 years ago

itsmemala commented 4 years ago

I get the following error when running main.py. How can I resolve this please ?

"Traceback (most recent call last):

File "C:\Users\SuresMal\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 3331, in run_code exec(code_obj, self.user_global_ns, self.user_ns)

File "", line 45, in df = ctakes_concept_extraction(data_dir, ctakes_folder, hard_section_list)

File "C:\Users\SuresMal\Documents\GitHub\clneg\src\concept_extraction.py", line 60, in ctakes_concept_extraction d = [e for e in extract_cuis(data_dir + 'tmp.xml')]

File "C:\Users\SuresMal\Documents\GitHub\clneg\src\concept_extraction.py", line 26, in extract_cuis cui_spans = get_cui_spans(xml_filename)

File "C:\Users\SuresMal\Documents\GitHub\clneg\src\concept_extraction.py", line 10, in get_cui_spans tree = etree.parse(xml_filename)

File "src/lxml/etree.pyx", line 3519, in lxml.etree.parse

File "src/lxml/parser.pxi", line 1839, in lxml.etree._parseDocument

File "src/lxml/parser.pxi", line 1865, in lxml.etree._parseDocumentFromURL

File "src/lxml/parser.pxi", line 1769, in lxml.etree._parseDocFromFile

File "src/lxml/parser.pxi", line 1163, in lxml.etree._BaseParser._parseDocFromFile

File "src/lxml/parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc

File "src/lxml/parser.pxi", line 711, in lxml.etree._handleParseResult

File "src/lxml/parser.pxi", line 640, in lxml.etree._raiseParseError

File "../data/tmp.xml", line 5 [SECTION-History of Present Illness-START] ^ XMLSyntaxError: Start tag expected, '<' not found, line 5, column 1"

Note:

  1. I'm running on a windows machine
  2. I am running the code for main.py line by line on a jupyter notebook (in the same dir as main.py)
itsmemala commented 4 years ago

tmp.txt I've converted the tmp.xml file to .txt and attached here for your refernece -> It appears it's not in xml format.