Cloudkibo / KiboPush

0 stars 1 forks source link

Investigate News Bots and messenger features that are useful for Publishers. #6652

Closed saniasiddiqui closed 4 years ago

saniasiddiqui commented 4 years ago

The task includes investigating features that will help news publishers engage with their audience on messenger. Look for how the bots for following organization are interacting with their audience:

Here is a document describing the broadcast usecases for news publishers https://docs.google.com/document/d/1HcZmv1rpNqwhQWUbFzPpcMZRWyULfGZZXvhnvG75nuA/edit#

saniasiddiqui commented 4 years ago

I am working on the following document: It is in progress now https://docs.google.com/document/d/1Fc6GzRGXDZ1zH6iVXe5ssDQQwJcIVcFpMd0rcP-ShEg/edit?usp=sharing

saniasiddiqui commented 4 years ago

I have added 3 usecases in the News Broadcast Usecase document by Imran:

  1. News Publisher connects RSS feeds with Kibopush.
  2. User asks subscribers to subscribe for daily updates.
  3. Subscriber subscribes for selected topics.

https://docs.google.com/document/d/1HcZmv1rpNqwhQWUbFzPpcMZRWyULfGZZXvhnvG75nuA/edit#heading=h.hsm0m07jal7x FYI @ImranBinShoukat @sojharo

saniasiddiqui commented 4 years ago

I have included technical details: database changes and tasks required, in the document

saniasiddiqui commented 4 years ago

I have checked with several publishers and almost all big names have their Rss Feeds. https://blog.feedspot.com/world_news_rss_feeds/

Chatfuel has a publisher template by TechCrunch. It is the basis for the official TechCrunch bot. They are also using Rss feeds for extracting news stories. And number of feeds can be added and linked with Bot. image

I also tried searching for web crawlers that we can implement on node.js. Webcrawlers basically work with DOM that url returns. It would parse the markup and provide some functions for traversing and manipulating the resulting data structure. Every website has a different structure so we will require to parse each website differently. There is no specific standard and hence it will not be possible to get content from all websites through a single web crawler logic. FYI @jekram @sojharo @ImranBinShoukat

saniasiddiqui commented 4 years ago

https://docs.google.com/document/d/1HcZmv1rpNqwhQWUbFzPpcMZRWyULfGZZXvhnvG75nuA/edit?usp=sharing

I have created the design document. Assigning @sojharo to review.

saniasiddiqui commented 4 years ago

Updated the document after the discussion in meeting. https://docs.google.com/document/d/1HcZmv1rpNqwhQWUbFzPpcMZRWyULfGZZXvhnvG75nuA/edit#heading=h.2lkralk09wbo

Please review @sojharo . Thanks

saniasiddiqui commented 4 years ago

i have updated the document with technical details and tasks. Assigning @sojharo to review thanks

saniasiddiqui commented 4 years ago

https://docs.google.com/document/d/1HcZmv1rpNqwhQWUbFzPpcMZRWyULfGZZXvhnvG75nuA/edit#

saniasiddiqui commented 4 years ago

I have reviewed technical details of the document with @ImranBinShoukat and I have opened the milestone with all the issues here: https://github.com/Cloudkibo/KiboPush/milestone/104

sojharo commented 4 years ago

I am going to reopen this issue as per discussion with Sir on slack. We need to do further design in this for those newspapers which do not have RSS. One option that sir suggested is to have the UI where admin can give the links to news items by sections.

sojharo commented 4 years ago

I have worked on the design document for v2 of publishers today. I have proposed two solutions in the document with the given problem. Please go through this.

https://docs.google.com/document/d/1u5NMI-_h-rAVFSnx4F_9AqlOH7B5g_HfsxvRJiOzaMA/edit#

This is under construction. I will add few more mockups and then define the technical part of the design once we have decided which solution we want to implement.

sojharo commented 4 years ago

Here is the list of top 20 news papers of USA

https://docs.google.com/spreadsheets/d/1j4aWOhEg-5NWJwfnCglxXrNPJxQ8eVJhG8ANQApjMmk/edit#gid=0

sojharo commented 4 years ago

I worked on this issue yesterday. As per our discussion, I updated the top 20 news list with UI layouts and whether all are vertical or not.

https://docs.google.com/spreadsheets/d/1j4aWOhEg-5NWJwfnCglxXrNPJxQ8eVJhG8ANQApjMmk/edit#gid=0

After this, I read up on creative commons for news websites:

https://wiki.creativecommons.org/wiki/journalism

Also, went through their news papers and saw what type of news they are sharing. I think we can easily take out news from them.

After this, for our news paper website, I looked into wordpress news themes and also one CMS that can be used to create news websites. I think wordpress or drupal are better options and then we can put theme of news on them. The one other news cms that I found looks old:

http://www.prosepoint.org/

Mission news is a free wordpress theme for news website and is good enough for starters:

https://www.competethemes.com/mission-news/

Step by step guide to create a news website:

https://www.competethemes.com/blog/make-news-website/

After this, I continued my work on design document which is under construction. As a next step, I will continue work on the design document for solution 1 (web scraper)

jekram commented 4 years ago

@sojharo 1. Is there a repeatable pattern(s) for your analysis.? We should automate it if there are few repeatable patterns that cover 90% of the use case.

  1. Did you look at the third-party tool

  2. Should we not give him a simple form to put 10 links (should we start with this)

For the Website for the newspaper give me an estimate for two things:

  1. What effort will it take to create it
  2. What effort it will take to update it daily
sojharo commented 4 years ago

As per our discussion on slack, I am going to complete the design document for both manual and automatic way for putting news. However, we will first start implementing manual with 10 links per section.

  1. There seemed to be repeatable patterns and had just different css, but not sure how web scrapers will work with them. Need to see web scrapers today

  2. I saw few web scrapers yesterday but need to look thoroughly and test run a few

  3. Agreed

For creating a news website:

  1. It is hardly two days effort to setup a wordpress site and put a news theme on it.
  2. 30 to 45 minutes on daily basis where work includes getting 5 to 6 articles from news sources and then put them in our wordpress site as new articles and give sources at the end. I am assuming that we will only go for on news category such as politics.
sojharo commented 4 years ago

I worked on the design document for both solution 1 and 2 and it is complete now. I have also completed the mockups plus database changes required.

After this, I also read up on and investigated on couple of open source web scrapers. I have discussed my investigation in the document as well.

https://docs.google.com/document/d/1u5NMI-_h-rAVFSnx4F_9AqlOH7B5g_HfsxvRJiOzaMA/edit#

sojharo commented 4 years ago

I have defined and opened all the issues required in work for manual news integration in https://github.com/Cloudkibo/KiboPush/milestone/108

I will use this task to work on testing of web scrapers now.

sojharo commented 4 years ago

I worked on this to test the web scraper and this is under construction. I was having couple of python libraries installation issue. Most of them are solved by me now. I am stuck with the following issue which I am trying to solve now.

Screenshot 2020-01-29 at 9 24 47 AM
sojharo commented 4 years ago

I was able to solve the python library installation problems today and then I tested the following web scraper:

https://github.com/codelucas/newspaper

I tested it on following newspapers:

  1. www.cnn.com
  2. www.wsj.com
  3. www.dawn.com

I was able to fetch the articles for these newspapers using this scraper and also was able to fetch list of categories for CNN only. CNN is given as an example in their documentation.

When I tried to fetch the news from the categories, it was not fetching them. Maybe, it is limitation on the web scraper or I did not understand how to fetch category wise news. However, it is fetching all the articles from all of the website, which means that all of the articles are being fetched from all categories.

After this, all the articles links are order alphabetically so we have to inspect to know which of them are on home page. I was able to inspect and see which of the articles are fetched from americas section of cnn newspaper.

Here is individual report on these 3 newspapers for this web scraper.

CNN

Screenshot 2020-01-29 at 5 01 27 PM

It seems it has worked fine on CNN and has brought the news articles from all of the website from all sections. As they are ordered alphabetically, I am able to see the news articles from section "americas" first.

WSJ

Screenshot 2020-01-29 at 4 53 41 PM

The news that has been fetched from WSJ by this web scraper is also fetching all the articles, but order is not maintained here. Just like CNN, the order is different from what we have on home page. We need to go and find which article is coming from which page.

Dawn

Screenshot 2020-01-29 at 4 54 46 PM

It is not working correctly on dawn news and is only fetching 5 articles which are also not top. One of them do appear on home page but others were difficult to find on home page.

As a next step, I am going to look into rest of two web scrapers to see how they are working.

https://github.com/je-suis-tm/web-scraping (supports number of newspapers) https://github.com/mylescarey2019/NewsScraper (only for LA times)

sojharo commented 4 years ago

I investigated on following two remaining web scrapers.

  1. https://github.com/je-suis-tm/web-scraping (supports number of newspapers)
  2. https://github.com/mylescarey2019/NewsScraper (only for LA times)

First one was doing a lot of problems and was a completely buggy code. I tried several ways to make it work but most of lines in all scraper files were buggy. I could not execute this first web scraper. This one is written in python.

Second one is written in nodejs. I was able to run it and also I was able to understand the code of scraper to see how it is working. I think from understanding of the code, we can create more custom web scrapers. This one is working on la times news paper. We can also see its demo here:

https://powerful-earth-53088.herokuapp.com

I think we have seen these web scrapers and have got enough understanding on how scrapers work. Also, we have few web scrapers which are working out of box for few three newspapers. So, if we want to implement solution 1 (web scraper based automation), at least we can show demo to our customers with one of these web scrapers that we already have in working condition. When we have new customer which is a newspaper and doesn't have RSS and can't do manual, we can then create one scraper for their website based on these.

jekram commented 4 years ago

Among the three which Web Scrapper you would propose?

Should we built that logic to give the user the option to do either manual, automated and semi-automated.

Automated is that they do it once and are happy with it. Manual is completely under there control and they provide 10 links Semi-Automated is they log in every day and push a button and get 10 links filled and then they can edit it. To keep things simple we should just do 1 carasoul. Once we have experience we can enhance it to have multiple topis.

sojharo commented 4 years ago

I would propose the one which is written in Nodejs and it is working on LA Times. We can easily modify it to use it for other newspapers as well. I have also got good understanding of its code and I will be able to modify it for multiple of newspapers.

Should we built that logic to give the user the option to do either manual, automated and semi-automated

For now, we are doing following two ways to do Newspaper integration.

  1. RSS Feeds Integration (the one that sania and anisha did)
  2. Manual News Integration (under construction, designed by me)

Both of these have their separate screens today so that they don't get mixed with each other. If newspaper admin wants can go to RSS or can go to manual.

The above two options are automated and manual.

Now, for semi-automated way, we can make use of the web scraper. This should use the web scraper on the button click and fetch 10 links and fill them in UI. The admin should be able to edit the links.

We can merge it in manual, all of the screens of current manual integration will be same and we will just add button on top to fetch links automatically. This will first check if we have web scraper available for his/her newspaper or not. If available we will use the web scraper, else they will be shown message that please contact kibopush team to have the web scraper developed for your newspaper.

In this way, even with web scraper, admin will have control over what is being fetched from scraper.

jekram commented 4 years ago

When can we meet to discuss this? We have too many implementations - we need to rationalize into few

sojharo commented 4 years ago

As per our discussion, I have formulated the following email that we can send publishers to seek their permissions. I have carefully tried to keep the email short and put all the important points in it. I have also sent it to your email.

Hello,

I hope you are doing well. I am writing to inform that we are a non-profit online content curation and publishing agency which is aiming to take news articles on politics topic and republish them on our platform with references linking back to original source.

We are aiming to use content published under a creative commons license only. As we understand and highly value the intellectual property rights, it is our policy to obtain prior permission from the source publishers before reposting their content on our website.

If you give us permission to republish articles from your Politics section on our website, we will be highly obliged and will always link back to your website for each article we use. At the top of each news article, it will be clearly stated that this article was originally published on your website.

We are aiming to follow the following creative commons journalism model:

https://wiki.creativecommons.org/wiki/journalism

As per above licenses, if authors of your newspaper are willing to share the articles with us, we will be happy to connect them through your platform. This will enable us to keep the spirit of open journalism alive.

Looking forward to hearing from you.

Thank you so much.

Please let me know what we can change in it.

sojharo commented 4 years ago

I worked on this issue today and am able to successfully fetch the articles from headlines of dawn.com. I have also captured and added picture in it which was not there in original code.

I have capture top 10 news in order as listed on dawn.com. Please see screenshots from dawn home page and web scraper page.

Home Page

Screenshot 2020-02-06 at 4 53 36 PM

Web Scraper

Screenshot 2020-02-06 at 4 52 28 PM

With this, I think we can now implement a demo of semi-automated option for dawn news. If you want, I can try doing it for one or two other news websites as well.

As a next step, I can work on the design document to update it and add the design for semi-automated options as well. It can simply be upgrade to our existing manual news integration feature. This manual news integration feature is also complete now and I just did staging merge for that, it will be tested as a team either tomorrow or on Monday.

Please suggest on next steps here.

sojharo commented 4 years ago

As per discussion, I have worked on the web scraper and created two scrapers: one for wallstreet journal and one for USA today.

Now, we have three scrapers implemented for three news papers. 95% of the code between them is common and only 5% is different. The main difference between all these newspapers is coming from the HTML structure and CSS classes used on their websites. Some are using <article> html element and some are using <a> html elements. Some also has headlines which are shown as trending news on top and we have to ignore them. Some of the newspapers don't have the trending news.

This is the only reason the code between three scrapers has to differ as scrapers parse html and css classes to capture data from html pages.

This is how WSJ scraper looks. WSJ has some articles where pictures are not given for them.

Scraper UI

Screenshot 2020-02-07 at 1 07 30 PM Screenshot 2020-02-07 at 1 08 06 PM

WSJ website

Screenshot 2020-02-07 at 1 07 46 PM

This is how USA Today scraper looks.

USA Today website

Screenshot 2020-02-07 at 2 04 55 PM

USA Today web scraper

Screenshot 2020-02-07 at 2 05 10 PM

With just 5% changes, we are able to create web scraper for the each of the websites. If all of the newspapers have same HTML structure for articles, then we will not have to even do this 5% change as well.

sojharo commented 4 years ago

I have created a plan for news publishing work here:

https://docs.google.com/document/d/16MdWFhi-pXpFoOngBFDDNuOPZHmfqfUuJP66Ki5s7l4/edit#

sojharo commented 4 years ago

I had a team meeting with sania and anisha and we discussed the process of how we should be publishing on daily basis. I have updated the document with options and we can decide on any one of them.

https://docs.google.com/document/d/16MdWFhi-pXpFoOngBFDDNuOPZHmfqfUuJP66Ki5s7l4/edit#heading=h.7rxs3safga1s

Instructions on how to publish articles is under construction in this.

sojharo commented 4 years ago

I have completed the tutorial on how to publish articles on our news website. It has step by step instructions with screenshots:

https://docs.google.com/document/d/16MdWFhi-pXpFoOngBFDDNuOPZHmfqfUuJP66Ki5s7l4/edit#heading=h.kvdapvso8vvy

We can close this issue now.