Baessler-Lab / swag-tag

A Streamlit-based webapp for annotation of medical images and reports.
2 stars 0 forks source link

feat (io/sql): Parse reports and store to postgres. #4

Closed laqua-stack closed 1 year ago

laqua-stack commented 1 year ago

Feature Request

Is your feature request related to a problem?

Currently reports from MIMIC are stored in separated text files. This file is structured like this: image_720

Sometimes keywords are also missing.

Describe the solution you'd like:

Parse the reports to a dict and store them as a line into our reports postgres table. We could either cycle the files and str.split() them or we could use re in that case and use the ALLCAPS: to group the file.

Describe alternatives you've considered.

Alternatively, they could be stored as jsonb object into the PG Table.

Additional Contextunnecessary

Handling unnessary linebreaks may come in handy.

laqua-stack commented 1 year ago

I think something like (untested)

import re
from collections import defaultdict
result = defaultdict(str)
s: str
matches = re.findall(r'(?=<keyword>[A-Z]*:)(?=<item>.*?)[A-Z]*:', s)
for match in matches:
    if match:
        result[match.group('keyword')] = match.group('item')
laqua-stack commented 1 year ago

Maybe we can handle the last entry by this? matches = re.findall(r'(?=<keyword>[A-Z]*:)(?=<item>.*?)[A-Z]*:|(?=<keyword>[A-Z]*:)(?=<item>.*?)\Z', s)

laqua-stack commented 1 year ago

@AmarHek Finally this did the trick ;-)

def parse_txt_reports(
        fpath_report: Path,
) -> typing.Dict:
    report_content = defaultdict(str)
    with fpath_report.open('r') as f:
        s = f.read()
        s: str
        matches = re.finditer(
            r'.*?(?P<keyword>[A-Z]+):(?P<item>.*?(?=[A-Z]+:|$))',
            s,
            flags=re.DOTALL
        )
        for match in matches:
            if match:
                report_content[match.group('keyword').lower()] = match.group('item')

    return report_content