Any link inputted gets summarized.

Aadityaa2606 commented 8 months ago

bug Description The issue is if we input any link (eg. www.google.com) the summariser thinks it's an article link and summarises it.

To Reproduce Steps to reproduce the behavior:

Go to https://aisummariser.oxlac.com/ or clone and run the local dev environment.
copy and paste a nonrelevant link.
click on go.
see the non-relevant summary.

Expected behavior Prevent accepting irrelevant links, if the user tries to submit an irrelevant link then show them an error similar to this

Bug Screenshots bug

Possible approaches

Try to make a list of all legitimate new article providers and if that link doesn't start with that specific parent URL show the error toast
A better way is to make a web crawler that finds the legitimacy of an article for better insight check this out https://www.quora.com/Is-there-a-News-API-web-crawler-to-determine-if-URL-is-an-article-or-navigation-page
Find any existing API's that can return the type of content in the URL and with that information sort out the non article url's

rnavaneeth992 commented 8 months ago

claim

Aadityaa2606 commented 8 months ago

The fix proposed by @rnavaneeth992 is a really good approach but not quite feasible for every irrelevant link, so I am reopening the issue back for other contributors to make additional improvements to the detection system on top of the existing approach!

Explanation of the Fix:

The previous fix added a try-catch block after querying the URL, the try-catch block Raises an HTTP error if the HTTP request returned an unsuccessful status code. which means if the link didn't give any HTTP error it doesn't detect the link is irrelevant
There was also a second check placed that if the summary content is 0, it predicts the URL as invalid.

Additional improvements that can be made:

Right now the existing approach finds and prevents a few links from getting summarised like www.google.com but still, there are sites like https://www.linkedin.com/feed/ https://github.com/ https://www.udemy.com/ and many more
We need a concrete method that separates news articles from normal websites to prevent irrelevant results and make the web application more feasible.

Oxlac / AI-News-Summariser

Any link inputted gets summarized. #2