tomislav commented 5 years ago

Bridge request

Sadly, National Geographic doesn't have an RSS feed anymore. The bridge should get the most recent articles published on National Geographic.

Also, would be nice to specify a category you're interested. ie. "magazine" only.

General information

Host URI for the bridge (i.e. https://github.com): https://www.nationalgeographic.com
Which information would you like to see?

Get a feed of the most recent articles published on National Geographic.

How should the information be displayed/formatted?

Title Lead image Description

Which of the following parameters do you expect?
- [X] Title
- [X] URI (link to the original article)
- [ ] Author
- [X] Timestamp
- [X] Content (the content of the article)
- [ ] Enclosures (pictures, videos, etc...)
- [ ] Categories (categories, tags, etc...)

Options

[ ] Limit number of returned items
- Default limit: 5
[ ] Load full articles
- Cache articles (articles are stored in a local cache on first request): yes
- Cache timeout (max = 24 hours): 24 hours
[X] Balance requests (RSS-Bridge uses cached versions to reduce bandwith usage)
- Timeout (default = 5 minutes, max = 24 hours): 5 minutes

logmanoriginal commented 5 years ago

I quickly checked the site which unfortunately returns no contents if javascript is disabled. That makes it unusable for RSS-Bridge. However, they do provide an API, which can be used by individuals and open source projects: https://newsapi.org/s/national-geographic-api

They require an attribution link for their contents, which is reasonable and actually a desired outcome for RSS-Bridge as well. Generally, their terms sound reasonable to me. I don't have time to go further into it, but it sure looks like a feasible task to make a Bridge using their API (which is not limited to National Geographic it seems).

https://newsapi.org/pricing

I'm actually impressed :open_mouth:

tomislav commented 5 years ago

All the content that is displayed on their frontpage is embedded in the HTML as a JavaScript array/dictionary. Maybe it could be scrapped with a regex?

logmanoriginal commented 5 years ago

You are right, it does contain the JSON data. Not sure how I missed that before. I went ahead and made a small bridge from the contents I could find, see #1065. Let me know if this is what you wanted. There are other endpoints from which contents can possibly be extracted (like the one I linked in the PR).

tomislav commented 5 years ago

Thanks! Looks good to me.

About other endpoints, I think people would most be interested in getting a feed off articles published in the magazine. https://www.nationalgeographic.com/magazine/

logmanoriginal commented 5 years ago

I changed the bridge to build a feed off articles in the magazine. Please take a look. How about including full articles? Currently the items in the feed have no contents, because there is no content on the original page. Technically it's possible to collect each article, but that take extra time on each request. Let me know what you think about that.

tomislav commented 5 years ago

IMHO, there should still be a "latest stories" bridge. That's where most of the articles and daily news are posted. Built off https://www.nationalgeographic.com/latest-stories/

But it would be nice to have an additional "magazine only" bridge, for people who are interested only in the big stories.

I don't know if this requires two separate bridges?

About the the full articles, I poked around with the web inspector and it seems doable, only the images would have to extracted from the tags and rewritten as so they work in RSS readers. Not sure how much of a hassle that is, but this is great as is.

logmanoriginal commented 5 years ago

Thanks for the feedback. I'll see if I can find some time this week to get it done.

I don't know if this requires two separate bridges?

It's doable in a single bridge, using contexts: https://github.com/RSS-Bridge/rss-bridge/wiki/const-PARAMETERS#level-1---context

About the the full articles, I poked around with the web inspector and it seems doable, only the images would have to extracted from the tags and rewritten as so they work in RSS readers. Not sure how much of a hassle that is, but this is great as is.

I suppose you mean images have relative links, right? (haven't checked yet) This is easily solvable, using defaultLinkTo.

logmanoriginal commented 5 years ago

I've added most features. You can now select the topic from a drop-down list and choose to include the full article as well (which can take a while and may not work if the timeout is set too low on your server). Images in the article are not included, however.

Also, there is no time stamp included in the raw data, so feeds will have to rely on titles.

Let me know if this now works for you.

tomislav commented 5 years ago

Thanks. I just tried it out and works perfect. I'll let you know in a few days if there were any issues.

I presume they load images with javascript? Bummer.

logmanoriginal commented 5 years ago

I presume they load images with javascript?

To be honest I haven't checked yet. Lead images are simply provided in the JSON data. For full articles the current filter only covers text. I'll take another look, maybe images can be extracted the same way.

logmanoriginal commented 5 years ago

That was easier than I thought. Try the latest version, it includes images for full articles.

tomislav commented 5 years ago

Thanks, I'll try it.

One thing that I noticed is that I'm getting duplicated articles in my RSS reader (Feedbin). Are the uid's on the articles changing when they update the page? I've tried commenting out the uid assignment line so it relies on uri's to see if it makes difference.

tomislav commented 5 years ago

I can confirm I'm no longer getting duplicates after I removed the following line:

$item['uid'] = $story['id'];

Otherwise, it's working great. I appreciate it a lot.

Some "Maybe/Someday" things that I wanted to write down for reference:

Include image captions below images
Include "hero" images and carousels (at the top of the page, above the article title)
Include carousels inside article text

logmanoriginal commented 5 years ago

Great, I'm glad this is working for you!

I removed the uid and included image captions. What do you mean with "hero" images and carousels?

There are carousels mentioned in the JSON data, but from what I can tell they are placed below the contents and not above - maybe I'm looking at the wrong contents. It would be great if you could share a screen shot to illustrate what you mean.

tomislav commented 5 years ago

Hero image (circled red) https://www.nationalgeographic.com/animals/2019/03/leopards-coexist-hindu-community-india/

also see: https://www.nationalgeographic.com/environment/2019/03/sabetta-yamal-largest-gas-field/

Hero carousel https://www.nationalgeographic.com/travel/lists/food-and-drink/worlds-best-food-cities/

Carousel in article https://www.nationalgeographic.com/environment/2019/03/whale-dies-88-pounds-plastic-philippines/

logmanoriginal commented 5 years ago

Thanks for the screenshots. Hero images were already included as enclosures. I just added support for hero carousels at top (added to enclosures) and in the article (added to contents).

Find the latest version at https://github.com/RSS-Bridge/rss-bridge/pull/1065

Does that work for you?

tomislav commented 5 years ago

Thank you. I don’t think any RSS reader displays enclosures, so hero images and carousels should probably go directly into the content (top) with their corresponding captions.

logmanoriginal commented 5 years ago

This was added, so I'm going to merge this now. Please open a new issue if further changes are necessary.

RSS-Bridge / rss-bridge

Bridge request for National Geographic #1029

Bridge request

General information

Options