data-liberation-project / phmsa-hazmat-incident-reports

Data from decades of PHMSA's "5800.1" hazardous material transportation incident reports
https://www.data-liberation-project.org/datasets/phmsa-hazmat-incident-reports/
7 stars 3 forks source link

Raise exception on unqueryable months #1

Closed jsvine closed 1 year ago

jsvine commented 1 year ago

The data in the portal only extends from January 1971 to the current month. To avoid pinging the server with unqueryable dates, scripts/00-fetch.py should throw an exception for months in the future and for months before 1971.

gcappaert commented 1 year ago

The code below at the top of the main() function works. I didn't submit a pull request yet because:

1) It would require an additional import from dateutil, which isn't part of the standard library. If that's an issue, I can write it using timedelta only, but as you know it's tricky about months/years.

2) Would you rather throw exceptions in main() or lower in the call stack? I figure you'll reuse some of this code in the future, so I want to make sure to know where to put an exception with hardcoded dates specific to this dataset.

Always feel free to tear apart my code. I'm definitely still learning.

Cheers,

Gustav

def main() -> None:
    args = parse_args(sys.argv[1:])

    today = datetime.today()
    if args.year < 1971 or args.year + args.month > today.year + today.month:
        raise ValueError("Reports are only availble from January 1971 to present")

    end_date = datetime(year=args.year, month=args.month, day=1) + relativedelta(months=args.num_months)
    if end_date > today:
        raise ValueError("The number of months specified includes a period in the future for which there is no data. Only data up to the present can be fetched")
jsvine commented 1 year ago

Thanks for sharing this draft and for flagging your concerns/questions, @gcappaert. Really appreciate it.

Re. 1: Although I'm not categorically opposed to adding dateutil as a dependency, I think we can probably skip it. In fact, per your second question, I think moving the check higher up in the stack would solve that.

Maybe something like this:

The downside vs. your approach is that users might not receive that error until partway through their scrape (if, e.g., they're trying to do the 24 months back from January 1972) — but I'm not too concerned about that because the scraper is designed to fail gracefully mid-scrape. The upside is one less dependency, and a little bit more modularity.

What do you think? Certainly open to other ideas, and entirely possible there's a flaw in my suggestion.

gcappaert commented 1 year ago

I think that's just fine! Your approach is more modular and it has the advantage of fetching data up 'til it runs into a problem date. I submitted the pull.

jsvine commented 1 year ago

Completed in #7!