Closed lielran closed 6 years ago
Hi @lielran,
This library is using two different scraping approaches, only one of them is using (puppeteer) browser. Visa cal for example inherit from BaseScraper
which doesn't require browser at all is it does direct calls to the server api. The purpose of createScraper
factory function is to hide the detail implementation of the chosen scraper so I'm not sure it is advised to ask the browser instance as part of the library public api.
I assume you want this feature for a specific reason? if so we can easily extend the options argument to include other properties that will be propogated to puppeteer if needed. Can you please elaborate about the use-case you try to achieve?
Thanks! Eran.
I see...I wasn't aware of the API approach.
I still think since you are using createScraper
for each scraper(or bank vendor) it should be possible to init the ones the using puppeteer outside createScraper
function.
puppeteer has a lot of configurations(see this one: https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md#puppeteerlaunchoptions) I don't want to expose all of them via the options + keep them all up to date ext...
another point is to reuse the same browser
instance.
the use case is to use israeli-bank-scrapers with the existing flow where I had an instance of the browser(of puppeteer).
puppeteer has a lot of configurations ... don't want to expose all of them via the options + keep them all up to date ext...
in BaseScraperWithBrowser.initialize() the library currently propagate only one property but we can easily extend it to propagate the complete options
instance or even options.puppetter
dedicated object. Since this library is not based on typed languages (like Typescript), it should affect future updates of puppeteer. @eshaham wdyt?
another point is to reuse the same browser instance.
I see, let's wait for @eshaham feedback about this one.
@esakal cool, let's wait to @eshaham.
once we all agree I can work on a PR for that.
@lielran I totally agree with the need for getting better control over the instantiation of the browser instance, both for the reasons you've mentioned, and also for being able to control the version of puppeteer being used. For the first reason, we can easily send any additional props the user adds to the options object directly to the puppeteer launch function. But I think that allowing to inject the browser instance would also solve the second reason I mentioned above.
I actually wanted to take this to the next phase and allow re-using the same browser instance and tabs during a session using a pool. I need to find time to incorporate that.
Anyway, PR welcomed for injecting the browser instance.
@eshaham I'm not sure if it is possible to safely inject the browser instance from outside of the library as the internal services/utils depends heavily on the puppeteer api. A good example is the the api change in puppeteer for $$eval
(#135) that broke version 0.6.0. I can think on many other places in the code that can be affected and it will open the library to problems that will be hard to maintain as you will need to make sure the code keeps backward compatibility to all puppeteer versions somehow.
Both requests can be achieved by passing additional arguments to the puppeteer (it can be an object dedicated to puppeteer that is passed as-is by the user) and by adding extra configuration that will manage different strategies for browser instances.
@esakal - can you explain a bit more why injecting will cause issues? from what I've seen you use this.browser
which was the returns from the puppeteer.launch()
method?
I started to work on that PR and initialize puppeteer outside the lib scope and passing the browser
object to the lib and all works fine on two scrappers (Isracard, Otsar HaHayal).
@lielran try to inject old version of puppeteer, take something very old like 0.13.0 or 1.0.0. Although it seems silly choosing old version it simulates real future scenario:
v1.6.0
which is latest today.v2.0.0
with breaking apiisraeli-bank-scrapers
israeli-bank-scrapers
it will fail to work.to solve # 4 one of the following should happen:
israeli-bank-scrapers
will need to manage compatibility internally, something that is not obvious and requires architecture change (which is possible is @eshaham will decide about it).btw, I'm a bit more worried about scenarios where the break will not be noticeable or a one that will happen immediately which will cause the consumer of the library to break without knowing why and will be hard to detect the source of it.
@esakal In any given state israeli-bank-scrapers
depended on one specific version puppeteer
.
if you want to inject an old version you should align with the old version of israeli-bank-scrapers
.
you can't really expect it to work if you injecting very old version. another option would to use peer dependency - but this may be an overkill for starts...
the most important thing to say it will not affect all of israeli-bank-scrapers
users since it is an optional way to work with the library.
@eshaham I'm not sure if it is possible to safely inject the browser instance from outside of the library as the internal services/utils depends heavily on the puppeteer api. A good example is the the api change in puppeteer for $$eval (#135) that broke version 0.6.0. I can think on many other places in the code that can be affected and it will open the library to problems that will be hard to maintain as you will need to make sure the code keeps backward compatibility to all puppeteer versions somehow.
@esakal good point. One way around this problem is to extract the dependency on puppeteer, define an interface the browser object should implement, and implement PuppeteerBrowser class which implements this dependency (could be either on a different repo, or the same one). That way whoever injects any version of puppeteer / a different framework would be in charge of making sure the wrapper implements the required interface.
I don't know if you know this, but when I first started working on this repo I was using Phantom.js.
See this old commit...
It was a nightmare (funny enough I was also examining Nightmare.js 😄 )!
It is likely there would be another better framework in the future, and decoupling the module from puppeteer would be of good service to the community.
What do you guys think?
Indeed Nightmare(same goes for CasperJs). I was also my path till Google announce puppeteer, it was clear to me that it will replace all the headless browser scraping/testing tools
regarding the injecting - I think that using peer dependency and enforcing working on the same version should work and prevent issues but I wouldn't jump to that solution right now.
I think that good documentation or maybe validating puppeteer version on runtime can proper error message cover us up
@eshaham Nightmare was so buggy, I hated using it.
One way around this problem is to extract the dependency on puppeteer...
I love the suggested architecture as it will make this library more robust. But (there is always a but) I'm not sure if it worth the effort as it will complicate the maintenance and I don't think many people will use it. If you have a backlog this one should definitely be part of that list. If you still want to check this one, we can brainstorm about it.
I think that good documentation or maybe validating puppeteer version on runtime can proper error message cover us up
@lielran I'm not sure if documentation will help here as people usually just pull directly from npmjs and only open github if they have issues.
Thinking about the original request. If you just want to allow passing puppeteer instance from outside of the library, before scraping we can make sure it is the same version that is being used by this library and stop the process if it doesn't. @lielran is this what you were thinking of?
@esakal yes. question is if we can know which version of puppeteer you have in runtime and compare to the version of the library.
what do you say of the peer dependency? isn't it a good way to enforce the same version?
Hi, I didn't find a way to know at runtime what is the puppeteer version. did you find such a way? I say as this original requirement was to allow external puppeteer initialization and you achieved it with #145, it might be enough for now. It will be the responsibility of the consumer who uses this option to check for compatibility when he updates this library. I will approve your pr and will let @eshaham to merge it.
I love the suggested architecture as it will make this library more robust. But (there is always a but) I'm not sure if it worth the effort as it will complicate the maintenance and I don't think many people will use it. If you have a backlog this one should definitely be part of that list. If you still want to check this one, we can brainstorm about it.
Agreed. It's been on my list for a while, just waiting for when we really need it, or a relevant PR.
I say as this original requirement was to allow external puppeteer initialization and you achieved it with #145, it might be enough for now. It will be the responsibility of the consumer who uses this option to check for compatibility when he updates this library.
Agree with @esakal.
I will approve your pr and will let @eshaham to merge it.
what's the status of #145? seems like we're still waiting for changes by @lielran?
I found it hard to control the launch command of puppeteer. I would be super nice If I could pass
browser
object tocreateScraper(option)
function.First of all: how do you feel about it❓ as I see it there is couple of options:
option 1 :
where browser is
const browser = await puppeteer.launch(SomeParams);
option2:
createScraper(options,maybeBrowser)
both options will need to something like this:
anything else?