feedbin / support

83 stars 11 forks source link

Links without hostname don't work #177

Open henrik opened 11 years ago

henrik commented 11 years ago

Links without a hostname, like in the http://etsy.nyh.name/search?q=pug feed, break with Feedbin (it assumes a host of feedbin.me). I'm pretty sure they work Google Reader and Reeder (they assume the host of the feed link).

benubois commented 11 years ago

Feedbin rewrites all links to point to the source. Could you the entry id of the entry that is not being rewritten properly? When I subscribe to http://etsy.nyh.name/search?q=pug I get links like http://etsy.nyh.name/listing/108211456/concrete-pug-dog-statue-or-memorial?ref=sr_list_3&sref=&ga_search_query=pug&ga_order=date_desc&ga_page=1&ga_view_type=list&ga_search_type=all&ga_facet=pug

Which is using the hostname of the feed itself, since the feed does not provide an <link rel="alternate" type="text/html" href=""/> option.

Feedbin uses html-pipeline for all HTML filtering, in this case absolute_href_filter.rb is used to try make all urls fully qualified.

henrik commented 11 years ago

I should have been clearer: it's the links inside the html content. Click the image in any post and it should have the problem described.

benubois commented 11 years ago

Hmm, in the html content I'm still seeing links being rewritten correctly. Feedbin uses Feedzirra for feed parsing and here's the website url that Feedzirra returns for this feed:

irb(main):001:0> feed = Feedzirra::Feed.fetch_and_parse 'http://etsy.nyh.name/search?q=pug';
irb(main):002:0* feed.url
=> "http://etsy.nyh.name/search?q=pug"

What should Feedzirra look at to determine the correct site url for the feed?

screen shot 2013-06-03 at 2 43 00 pm

henrik commented 11 years ago

Hm, this is weird. I see a few different versions in feeds from the same site (I made the scraper/feeder) but with different queries.

In one feed for a "syrup pitcher" search, it doesn't add any host, so it uses Feedbin:

screen shot 2013-06-03 at 23 54 48

In another feed for a "pug" search, it uses the correct etsy.com host:

screen shot 2013-06-03 at 23 57 21

And in another feed for the same "pug" search that I just re-added (apparently you can have the same feed twice), I get the host of the feed site:

screen shot 2013-06-03 at 23 57 54

Pretty weird. Maybe that part of the code/library used has changed recently, and there are some old caches around?

henrik commented 11 years ago

Nevermind about the "same feed twice" - they had different params. These are the URLs of each feed:

benubois commented 11 years ago

Thanks for the feed urls. Here's what I'm seeing:

http://etsy.nyh.name/search?q=syrup%20pitcher&view_type=gallery&ship_to=US

For this one I'm getting an exception when trying to URI.join the site url with the entry url:

irb(main):004:0> URI.join 'http://etsy.nyh.name/search?q=syrup%20pitcher&view_type=gallery&ship_to=US', '/listing/105720881/cobalt-blue-glass-pitcher?ref=sr_list_30&amp;sref=&amp;ga_search_query=syrup+pitcher&amp;ga_ship_to=US&amp;ga_order=date_desc&amp;ga_page=1&amp;ga_view_type=list&amp;ga_search_type=all&amp;ga_facet=syrup pitcher'
URI::InvalidURIError: bad URI(is not URI?): /listing/105720881/cobalt-blue-glass-pitcher?ref=sr_list_30&amp;sref=&amp;ga_search_query=syrup+pitcher&amp;ga_ship_to=US&amp;ga_order=date_desc&amp;ga_page=1&amp;ga_view_type=list&amp;ga_search_type=all&amp;ga_facet=syrup pitcher
    from /Users/ben/.rbenv/versions/2.0.0-p195/lib/ruby/2.0.0/uri/generic.rb:1203:in `rescue in merge'
    from /Users/ben/.rbenv/versions/2.0.0-p195/lib/ruby/2.0.0/uri/generic.rb:1200:in `merge'
    from /Users/ben/.rbenv/versions/2.0.0-p195/lib/ruby/2.0.0/uri/common.rb:237:in `each'
    from /Users/ben/.rbenv/versions/2.0.0-p195/lib/ruby/2.0.0/uri/common.rb:237:in `inject'
    from /Users/ben/.rbenv/versions/2.0.0-p195/lib/ruby/2.0.0/uri/common.rb:237:in `join'
    from /Users/ben/.rbenv/versions/2.0.0-p195/lib/ruby/2.0.0/uri/common.rb:785:in `join'
    from (irb):3
    from /Users/ben/Sites/feedbin/vendor/bundle/gems/railties-4.0.0.rc1/lib/rails/commands/console.rb:90:in `start'
    from /Users/ben/Sites/feedbin/vendor/bundle/gems/railties-4.0.0.rc1/lib/rails/commands/console.rb:9:in `start'
    from /Users/ben/Sites/feedbin/vendor/bundle/gems/railties-4.0.0.rc1/lib/rails/commands.rb:66:in `<top (required)>'
    from ./bin/rails:4:in `require'
    from ./bin/rails:4:in `<main>'

This works fine if the search term is url encoded (syrup%20pitcher vs syrup pitcher)

irb(main):004:0> URI.join 'http://etsy.nyh.name/search?q=syrup%20pitcher&view_type=gallery&ship_to=US', '/listing/105720881/cobalt-blue-glass-pitcher?ref=sr_list_30&amp;sref=&amp;ga_search_query=syrup+pitcher&amp;ga_ship_to=US&amp;ga_order=date_desc&amp;ga_page=1&amp;ga_view_type=list&amp;ga_search_type=all&amp;ga_facet=syrup%20pitcher'
=> #<URI::HTTP:0x007fd642068270 URL:http://etsy.nyh.name/listing/105720881/cobalt-blue-glass-pitcher?ref=sr_list_30&amp;sref=&amp;ga_search_query=syrup+pitcher&amp;ga_ship_to=US&amp;ga_order=date_desc&amp;ga_page=1&amp;ga_view_type=list&amp;ga_search_type=all&amp;ga_facet=syrup%20pitcher>

Not sure if this is something Feedbin is doing wrong but it does not make any attempt to encode values before doing the URI.join.

After the exception is thrown, Feedbin just defaults to the entry url by itself.

http://etsy.nyh.name/search/vintage?q=pug&view_type=gallery&ship_to=US&min=0&max=0

I think this one is correct because it was originally imported from an OPML file or from a type="application/rss+xml" or type="application/atom+xml" tag on a website. In these cases Feedbin prefers the OPML htmlUrl attribute or the URL that contained the type" tag because it's usually a more reliable way of determining the site url.

http://etsy.nyh.name/search?q=pug

This one is correct in the sense that it's adding the host properly albeit with the wrong hostname. Still not sure how to best get the correct hostname from these feeds.

henrik commented 11 years ago

Good digging.

http://etsy.nyh.name/search?q=syrup%20pitcher&view_type=gallery&ship_to=US

So about the "syrup pitcher", you're saying it's because the "link href" for each post contains an unencoded space? I'll look into fixing the feed not to do that, but I would guess that it's also a good idea for Feedbin to not break on those. I'm sure there's a lot of bad data out there. Almost certain Google Reader didn't choke on these.

http://etsy.nyh.name/search?q=pug

As I understand it, a good place to look for the host, if the host is not present in the entry link href, would be the feed link href that has rel=alternate (or no rel at all, since alternate is the default): http://www.atomenabled.org/developers/syndication/#link