atharva434 / INCF-Impact-visualization-Portal

3 stars 8 forks source link

Finding a new scraping API #28

Open atharva434 opened 3 months ago

atharva434 commented 3 months ago

For getting real time information Google serp API was used. Google serp API has its own set of limitations. So finding alternatives is important. Just opening this issue for an open discussion on the same.

surajgajul commented 3 months ago

On it sir.

surajgajul commented 3 months ago

@atharva434 Sir, Zenserp could be an alternative. Unlike the 100 searches/month limitation of google's serp api, zenserp offer 50 searches/month but it also provides limited competitor analysis for SEO capabilities. Apify also offer a free version, 5 USD credits/month and I think it offers web scraping facilities too. Gotta look into it a bit more though. I'll try looking for some other services too.

OmKhare commented 3 months ago

@atharva434 Hey, I looked in to the web scraping api issue, google's own custom search api has an option to return json output. It takes the same parameters like api key, location, exact match etc and gives json output. You can have a look at this, I tried giving the same query used in the application.

https://www.googleapis.com/customsearch/v1?key=AIzaSyAL6pN9G0fgfgd0RxrGGWLqoRFXrO_Dd6A&cx=1592e994325084760&q=no+of+people+suffering+from+cancer+in+the+world

This is the pricing of it: Custom Search JSON API provides 100 search queries per day for free which is a lot better than SERP api.

Basically google custom search returns results for only google search in json format and very minimal but sufficient scraped information while SERP api can be used to scrape any search engines results and has more features of capturing information.

surajgajul commented 3 months ago

Seconding this, but won't it limit the results to the custom search engine (like a couple of sites explicitly entered?), or does the free tier offer searching across the web? But ya, if not third party services, using google's custom search api is a safer option. Another way is to scrape so I tried the same thing with beautifulsoup but it's not very ethical and since we are looking for website's deployment, its better to not violate any terms of service.

atharva434 commented 3 months ago

Hi, Went through all of your suggestions. Google custom search sounds promising. Just putting a little more code for it might reduce the cost significantly. We cant scrape through beautiful soup directly it gives access denied exception which is why an API is needed.

atharva434 commented 3 months ago

Just out of curiosity did you guys go through duck duck go search?

surajgajul commented 3 months ago

Ya sir, I do remember seeing it in the code base somewhere so i tried it out earlier. Had to use an LLM for processing the query tho. You can check this out, it works well but might increase the dependency on LLM. (Link) The query given was cancer results: According to WHO, cancer is the second leading cause of death globally, and is responsible for an estimated 9.6 million deaths in 2018.

Also ran the query using duckduckgosearch.run() Results: In 2022, there were an estimated 20 million new cancer cases and 9.7 million deaths. The estimated number of people who were alive within 5 years following a cancer diagnosis was 53.5 million. About 1 in 5 people develop cancer in their lifetime, approximately 1 in 9 men and 1 in 12 women die from the disease. An individual's cancer risk has a lot to do with other factors, such as age. For instance, an American woman's lifetime risk of developing colon and rectal cancer is about 4%, or about 40 out of every 1,000 women. But her risk of developing colon and rectal cancer before the age of 50 is 0.4%, or about 4 out of every 1,000 women. The National Cancer Institute estimated that in 2020, 1,806,590 new cancer cases would be diagnosed and that 606,520 new deaths from cancer would occur. The rate of new cases of any type of cancer was 442.4 per 100,000 people per year, and the death rate was 155.5 per 100,000 people per year. The NCI breaks down these statistics to determine ... Around one in five people develop cancer in their lifetimes, the International Agency for Research on Cancer (IARC) said in a statement, and one in nine men and one in 12 women die from the ... For most types of tumour, their increase in people under 50 has been relatively modest so far.

OmKhare commented 3 months ago

duckduckgo search in itself is not that advanced as google is so the results that we get may contain vague answers, also the community support for duckduckgo api does not look that good

atharva434 commented 3 months ago

Oh okay so the duck duck go is still not very useful for us. Even the last time I tried it was giving big answers which is pretty useless. Cool just wanted to confirm.

surajgajul commented 3 months ago

yeah, it does work fine when using with an llm but that is kinda unnecessary if custom search is getting the job done.

VictorUmunna commented 3 months ago

@atharva434 , I also want to apply for this. How do I go about it? Interesting discussion

surajgajul commented 3 months ago

@VictorUmunna Hey, you need to make a proposal and submit it before 2nd of April. Go on to GSoC's official site and search for INCF. You will get a contributor proposal template and the project list over there.