HaveF / feedparser

Automatically exported from code.google.com/p/feedparser
Other
0 stars 0 forks source link

handle RFC822 dates with timezones like GMT+00:00 #304

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
this date handler handles these formats correctly as far as i can tell

def _myDateHandler(aDateString):
  import datetime, re
  _tuple = datetime.datetime.strptime(string.rstrip(aDateString, "+-1234567890:"), "%a, %d %b %Y %H:%M:%S %Z")
  _sign, _hours, _minutes =_pattern.search(aDateString).groups()
  if _sign == '-':
    _hours = -1 * int(_hours)
    _minutes = -1 * int(_minutes)
  else:
    _hours = int(_hours)
    _minutes = int(_minutes)
  _td = datetime.timedelta(hours=_hours, minutes=_minutes)
  _tuple += _td
  return _tuple.timetuple()

Original issue reported on code.google.com by Mic.Ga...@gmail.com on 6 Sep 2011 at 12:37

GoogleCodeExporter commented 9 years ago
This looks a great deal like an RFC822-style date. I like that you're using the 
extra argument to `rstrip()`! You should also be aware that the date 
interpretation codes are locale-dependent, so the month and day names (%a and 
%b) will only work if the locale the interpreter is set to is English-based.

Do you have a link to a specification that better explains this variation, or 
to a live feed that uses this, or to a piece of software that outputs this date 
format?

Original comment by kurtmckee on 6 Sep 2011 at 4:16

GoogleCodeExporter commented 9 years ago
try this url: 
http://news.google.de/news?pz=1&ned=de&hl=de&q=site:t-online.de&scoring=n&output
=rss

Original comment by Mic.Ga...@gmail.com on 7 Sep 2011 at 10:13

GoogleCodeExporter commented 9 years ago
Great, thanks!

I'm going to fix this by incorporating its handling in the RFC822 date parser 
already in feedparser, although that parser needs to be completely ripped out: 
I discovered that Mark copied almost all of its code from the rfc822 module in 
r147. For that reason I'm going to have to replace the entire function, but in 
the process I'll add support for this timezone notation.

Original comment by kurtmckee on 16 Sep 2011 at 3:13

GoogleCodeExporter commented 9 years ago
Here is another feed if you need test cases.

http://gdata.youtube.com/feeds/base/users/BlueXephos/uploads?alt=rss&v=2&orderby
=published&client=ytapi-youtube-profile

An example item from curling the feed:

<pubDate>Thu, 17 Nov 2011 15:06:46 +0000</pubDate>
<atom:updated>2011-11-18T17:01:01.000Z</atom:updated>

If you parse the feed with feedparser you'll see that published and 
published_parsed are not present, but updated and updated_parsed are OK.

Original comment by josh.ric...@gmail.com on 18 Nov 2011 at 5:07

GoogleCodeExporter commented 9 years ago
@josh.rickard: That's because in the current code, `pubDate` maps to `updated` 
rather than `published`. I'll have to review the stated purpose of the 
`pubDate` element to see if that's actually correct behavior.

Original comment by kurtmckee on 18 Nov 2011 at 5:37

GoogleCodeExporter commented 9 years ago
@kurt - that explains what I'm seeing.  My memory might be failing me, but I 
thought in the olden days of 4.x that `published` gave you when the item was 
first created and `updated` was when it was last modified -  obviously all 
contingent on what is present in the feed.

Original comment by josh.ric...@gmail.com on 20 Nov 2011 at 3:25

GoogleCodeExporter commented 9 years ago
Shoot, I was about to fix this but it looks like the sample feed has been fixed 
both at Google.de and at T-online.de. Do you have a link to a feed that's still 
exhibiting this problem? I don't want to fix this for just GMT+00:00 if it's 
been fixed or there are other similar variations that I can account for while 
fixing the problem.

Original comment by kurtmckee on 13 Feb 2012 at 8:47

GoogleCodeExporter commented 9 years ago
This issue was closed by revision r691.

Original comment by kurtmckee on 20 Feb 2012 at 6:35

GoogleCodeExporter commented 9 years ago
I finally found reference to this timezone format at:

    http://docs.oracle.com/javase/6/docs/api/java/text/SimpleDateFormat.html#timezone

Turns out the additional timezone info isn't limited to +00:00. Sorry it took 
so long to find information about this timezone format!

Original comment by kurtmckee on 20 Feb 2012 at 6:38