Adds built-in Selenium support for Javascript website scraping

emirsahin1 / llm-axe

A simple, intuitive toolkit for quickly implementing LLM powered applications.

MIT License

121 stars 26 forks source link

Adds built-in Selenium support for Javascript website scraping #8

Closed yoni13 closed 3 months ago

yoni13 commented 3 months ago

Adds built-in Selenium support for Javascript website scraping.

No more "The website encountered an error and could not provide the requested data.", issue anymore while the bot is crawling sites like Quora.

yoni13 commented 3 months ago

Before:

After:

yoni13 commented 3 months ago

Just a reminder that I haven't bump the version number

emirsahin1 commented 3 months ago

This is a great addition. However, I think it would be best if it takes advantage of the OnlineAgent's ability to receive a custom website reader rather than making it a parameter on the OnlineAgent.

Basically, the Selenium website reader you wrote should be a stand alone website reader in core.py that the user can import and use with the OnlineAgent.

Example use case with this new approach:

from llm_axe import selenium_reader
agent = OnlineAgent(llm, custom_site_reader=selenium_reader)

Let me know what your thoughts are. Thanks again!

yoni13 commented 3 months ago

Hello, I agree that making it a built-in reader function is a better option.

However, my original thoughts were only use selenium when JavaScript website is detected, if I made a custom selenium_reader and uses it every time, we will get significant longer response time.

Or maybe I have to make two functions,selenium_reader and selenium_and_soap_reader?

Let me know what you think, thanks!

emirsahin1 commented 3 months ago

Oh okay. I see what you are saying.

In that case, I think your suggestion of making two functions seems like the best approach. It gives more options to the user which is always great.

Maybe we can call the second one selenium_hybrid_reader?

yoni13 commented 3 months ago

Oh okay. I see what you are saying.

In that case, I think your suggestion of making two functions seems like the best approach. It gives more options to the user which is always great.

Maybe we can call the second one selenium_hybrid_reader?

That's a better name obviously, Thanks👍