AndyTheFactory / newspaper4k

📰 Newspaper4k a fork of the beloved Newspaper3k. Extraction of articles, titles, and metadata from news websites.
MIT License
460 stars 43 forks source link

Furthering Date Parsing #9

Open AndyTheFactory opened 11 months ago

AndyTheFactory commented 11 months ago

Issue by stripathi669 Sun Feb 8 13:25:27 2015 Originally opened as https://github.com/codelucas/newspaper/issues/119


First of all, let me say that this library is amazing.

Getting back at the topic the code for Date Parsing uses 3 techniques:

In practical scenario, when first two don't work, one has to rely on third. However, currently, your code uses this regex:

DATEREGEX = r'([./-]{0,1}(19|20)\d{2})[./-]{0,1}(([0-3]{0,1}[0-9][./-])|(\w{3,5}[./-_]))([0-3]{0,1}[0-9][./-]{0,1})?'

However, this regex doesn't encompass all formats in which date can be written. If we could replace it by regex that accounts of all standard date formats, this could be nice addition to the library.

I propose this for all mm/dd/yyyy or mm-dd-yyyy or mm.dd.yyyy formats

^(?:(?:(?:0?[13578]|1[02])(\/|-|\.)31)\1|(?:(?:0?[1,3-9]|1[0-2])(\/|-|\.)(?:29|30)\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$|^(?:0?2(\/|-|\.)29\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:(?:0?[1-9])|(?:1[0-2]))(\/|-|\.)(?:0?[1-9]|1\d|2[0-8])\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$

@codelucas : What do you think of it ?

AndyTheFactory commented 11 months ago

Comment by codelucas Sun Mar 15 00:45:42 2015


Great idea, this would definitely help. Mind filing a PR with that regex and presenting examples where your regex wins out?

AndyTheFactory commented 11 months ago

Comment by yprez Wed Mar 9 17:22:46 2016


Some work was done in #134