Closed laqua-stack closed 1 year ago
I think something like (untested)
import re
from collections import defaultdict
result = defaultdict(str)
s: str
matches = re.findall(r'(?=<keyword>[A-Z]*:)(?=<item>.*?)[A-Z]*:', s)
for match in matches:
if match:
result[match.group('keyword')] = match.group('item')
Maybe we can handle the last entry by this?
matches = re.findall(r'(?=<keyword>[A-Z]*:)(?=<item>.*?)[A-Z]*:|(?=<keyword>[A-Z]*:)(?=<item>.*?)\Z', s)
@AmarHek Finally this did the trick ;-)
def parse_txt_reports(
fpath_report: Path,
) -> typing.Dict:
report_content = defaultdict(str)
with fpath_report.open('r') as f:
s = f.read()
s: str
matches = re.finditer(
r'.*?(?P<keyword>[A-Z]+):(?P<item>.*?(?=[A-Z]+:|$))',
s,
flags=re.DOTALL
)
for match in matches:
if match:
report_content[match.group('keyword').lower()] = match.group('item')
return report_content
Feature Request
Is your feature request related to a problem?
Currently reports from MIMIC are stored in separated text files. This file is structured like this:
Sometimes keywords are also missing.
Describe the solution you'd like:
Parse the reports to a
dict
and store them as a line into ourreports
postgres table. We could either cycle the files andstr.split()
them or we could usere
in that case and use theALLCAPS:
to group the file.Describe alternatives you've considered.
Alternatively, they could be stored as
jsonb
object into the PG Table.Additional Contextunnecessary
Handling unnessary linebreaks may come in handy.