dh-tech / undate-python

A Python library for working with fuzzy, partial, or otherwise uncertain dates
Apache License 2.0
8 stars 1 forks source link

human readable date formatter #79

Open rlskoeser opened 5 months ago

rlskoeser commented 5 months ago

ideally should support conversion in both directions; implementation should support localization for so it works in multiple language; this either overlaps with or can leverage #78

ColeDCrawford commented 1 month ago

Cole and Rebecca tried out few-shot prompting for going from human-readable dates to EDTF. We provided the EDTF spec and a list of example string-EDTF pairs.

Cole tested GPT-4o. It did fairly well and responded well to corrections when it missed a kind of formatting - it parsed similar examples correctly after the correction. Rebecca tested Claude Sonnet 3.5; it did similarly well with formatting, but struggled with calendar conversions. We're going to stick with parsing human-readable dates to EDTF for now.

The core prompt:

I am providing the spec for the Library of Congress EDTF (Extended Date Time Format). I am also providing a list of paired plaintext strings with their EDTF representations as few shot examples. I would like you to respond to future queries of plaintext with the assumption that they are dates, and to parse them into the EDTF format. Return results with the JSON structure:

{
"input": <input_text>,
"edtf": <parsed_edtf>
}

EDTF_Date Conversion Experiment.md

ColeDCrawford commented 1 month ago

We are going to try to standardize a prompt with the needed EDTF spec and examples as context to include with each query. We create some test data and a script that we can use with multiple LLMs (maybe using litellm or another similar library) so we can benchmark against different large and small foundational models. If the results are promising, we can extend this into functionality that can be used in the core undate library.

rlskoeser commented 1 month ago

I was using the version of Claude integrated with the Zed editor, which takes advantage of Anthropic's context caching. Here's the prompt I started with:

@fetch https://www.loc.gov/standards/datetime/ Based on the specification for the Extended Date Time Format (EDTF) and examples of human readable strings and EDTF representation, please parse other plain text strings into EDTF format into this json structure:

{'input': <input_text>, 'parsed': <parsed_edtf>, 'notes': <comments>}

Here are some examples: ...

I was curious how or if it would handle calendar conversion, so I tried a few examples from the Princeton Geniza Project; it did ok on some of them but others had different results (or different precision) than what we have in PGP.

Here's the full transcript of my experiments, minus the contents of the LOC EDTF spec which the @fetch directive inserts at the top. claude-date-parsing-experiments.txt