arrow-py / arrow

🏹 Better dates & times for Python
https://arrow.readthedocs.io
Apache License 2.0
8.71k stars 673 forks source link

Return matched string after successful parsing #897

Open mredaelli opened 3 years ago

mredaelli commented 3 years ago

Feature Request

Desiderata: have a way to retrieve the part of the string that arrow matched to the format in a successful parse.

I see at least two use cases.

Get the "rest"

It is not an uncommon need, at least in the domain of web scraping, to extract a date from a string and to store the remaining information from that string somewhere else.

Getting the date with arrow is awesomely easy, but once I have that I don't know of a good way to "remove the date I extracted and get the rest of the string", other than formatting the date with all the formats and replacing it in the string.

Even that is not bulletproof, though, because

Multiple matches

Suppose I want all the dates that match a certain format in a string? As it is now, I only get the result from the first match.

If I had the information of where the result was matched, I could at least call get again on the substring right after the match.

jadchaar commented 3 years ago

One possibility could be to introduce an internal method that gets the indices of the matches, and this could be called in these special use cases (but otherwise would largely be internal). We could also introduce a new flag or method that does a multi-match regex and returns a list of arrow objects. This could require a decently sized refactor though.

I do know that the dateparser package is relatively popular for web scraping. Check out the search_dates method: https://dateparser.readthedocs.io/en/latest/#dateparser.search.search_dates.

This seems like a highly specialized feature request, so if dateparser does the job, let us know!

mredaelli commented 3 years ago

Dateparser is what we used before, but ran away from, so I'd much rather have the functionality in arrow :)

Not sure how "nice" it would be, but I'd be more than happy with just an optional parameter of get, say return_matched_string, which if True returns a tuple (date, match object) or simply (date, matched_string), instead of just the date.

But also just the dedicated low-level function would be great (assuming it's still going to be relatively stable :) )

Oh, and I can try my hands with a solution along one of these lines, if you want