adzap / timeliness

Fast date/time parsing for the control freak.
http://github.com/adzap/timeliness
MIT License
224 stars 27 forks source link

Parse prefix #2

Closed electrum closed 13 years ago

electrum commented 13 years ago

I'd like to parse a date from the start of a string, ignoring invalid characters after a valid date. For example, using mmm d, yyyy, parse the following:

March 21, 2010 is a Monday

The extra characters " is a Monday" would be ignored.

adzap commented 13 years ago

This is a tricky one. It should be possible to create a custom format which allows other text around it. But you might want to look at Chronic or Nickel https://github.com/lzell/nickel.

The limitation for Timeliness is that it uses regexps and removes some regex specific characters from a format before compiling the string into a regexp. This makes it a little tricky to navigate around this to allow open ended strings with junk in it.

Timeliness is more about control and speed than freedom to parse any string.

electrum commented 13 years ago

Thanks for the link to Nickel -- I hadn't seen that one. What I ended up doing was combining regexes with strptime:

  '\d{1,2}/\d{1,2}/\d{2}' => '%m/%d/%y',
  '\d{1,2}/\d{1,2}/\d{4}' => '%m/%d/%Y',
  '[a-z]{3,} \d{1,2}, \d{4}' => '%b %d, %Y',

You need the regexes because strptime will incorrectly parse strings that don't match the format:

ruby > Date.strptime('01/02/03', '%m/%d/%Y')
 => #<Date: 0003-01-02 (3444309/2,0,2299161)> 
ruby > Date.strptime('01/02/2003', '%m/%d/%y')
 => #<Date: 2020-01-02 (4917701/2,0,2299161)> 

Fortunately, it's laxness causes it to ignore the extra junk at the end.

adzap commented 13 years ago

I played with Timeliness more to get this to work and found this is possible

Timeliness.parse('March 12, 2011 asdf', :type => :date, :format => 'mmm d, yyyy [a-zA-Z0-9 ]*')

It's very fragile however. You would need to add more characters which may be included in the string.