baeseongsu / ehrxqa

EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health Records with Chest X-ray Images
MIT License
57 stars 3 forks source link

about datetime parsing issue #11

Open baeseongsu opened 3 weeks ago

baeseongsu commented 3 weeks ago

There is a potential error in the code below when the month is 12:

https://github.com/baeseongsu/ehrxqa/blob/724bff13a9a2e430ecd54f32e3ef3789bb7fcdb3/experiment/NeuralSQL/executor/sqlglot/executor/env.py#L238-L242

Especially, this part can evoke an error:

day = min(arg.day, (datetime.datetime(year, month + 1, 1) - datetime.timedelta(days=1)).day) 

The corrected version should be:

if month == 12:
    last_day = (datetime.datetime(year + 1, 1, 1) - datetime.timedelta(days=1)).day
else:
    last_day = (datetime.datetime(year, month + 1, 1) - datetime.timedelta(days=1)).day
    day = min(arg.day, last_day)
baeseongsu commented 3 weeks ago

Note that the previous (customized) datetime function for the sqlglot package, which was used in the arXiv (v2) paper:


def datetime_fn(arg, fmt=None):
    """
    Custom implementation of datetime function.
    """
    if isinstance(arg, str):
        arg = datetime.datetime.fromisoformat(arg)

    if fmt is None:
        return arg

    if fmt == "start of month":
        return datetime.datetime(arg.year, arg.month, 1)
    elif fmt == "start of year":
        return datetime.datetime(arg.year, 1, 1)
    elif fmt == "start of day":
        return datetime.datetime(arg.year, arg.month, arg.day)
    else:
        number, unit = fmt.split(" ")
        if unit == "day":
            return arg + datetime.timedelta(days=int(number))
        elif unit == "month":
            return arg + datetime.timedelta(days=int(number) * 30)
        elif unit == "year":
            return arg + datetime.timedelta(days=int(number) * 365)
        else:
            raise NotImplementedError(f"Unsupported unit '{unit}'.")
baeseongsu commented 2 weeks ago

Note that in SQLite3, the strftime function has different format specifiers, including %J and %j, which differ only in capitalization. The %J specifier is used to return the Julian day number as a floating-point value. For example, strftime('%J', '2024-08-27') might return something like 2461076.5, representing the Julian day number for that date.

On the other hand, %j returns the day of the year as a three-digit number, ranging from 001 to 366 (taking leap years into account). Therefore, to accurately compute the time gap between two different dates, you should use the uppercase format specifier %J.