eshaham / israeli-bank-scrapers

Provide scrapers for all major Israeli banks and credit card companies
MIT License
589 stars 160 forks source link

launch puppeteer browser outside the lib #142

Closed lielran closed 6 years ago

lielran commented 6 years ago

I found it hard to control the launch command of puppeteer. I would be super nice If I could pass browser object to createScraper(option) function.

First of all: how do you feel about it❓ as I see it there is couple of options:

option 1 :

  const options = {
    companyId: 'isracard', 
    startDate: '01-06-2018', 
    combineInstallments: false,
    verbose: true, // include more debug info about in the output
    browser: browser,
  };

where browser is const browser = await puppeteer.launch(SomeParams);

option2: createScraper(options,maybeBrowser)

both options will need to something like this:

if(browser !=null) {
  this.browser = browser;
}else {
  this.browser  = await puppeteer.launch();
}

anything else?

esakal commented 6 years ago

Hi @lielran, This library is using two different scraping approaches, only one of them is using (puppeteer) browser. Visa cal for example inherit from BaseScraper which doesn't require browser at all is it does direct calls to the server api. The purpose of createScraper factory function is to hide the detail implementation of the chosen scraper so I'm not sure it is advised to ask the browser instance as part of the library public api.

I assume you want this feature for a specific reason? if so we can easily extend the options argument to include other properties that will be propogated to puppeteer if needed. Can you please elaborate about the use-case you try to achieve?

Thanks! Eran.

lielran commented 6 years ago

I see...I wasn't aware of the API approach.

I still think since you are using createScraper for each scraper(or bank vendor) it should be possible to init the ones the using puppeteer outside createScraper function.

puppeteer has a lot of configurations(see this one: https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md#puppeteerlaunchoptions) I don't want to expose all of them via the options + keep them all up to date ext...

another point is to reuse the same browser instance.

the use case is to use israeli-bank-scrapers with the existing flow where I had an instance of the browser(of puppeteer).

esakal commented 6 years ago

puppeteer has a lot of configurations ... don't want to expose all of them via the options + keep them all up to date ext...

in BaseScraperWithBrowser.initialize() the library currently propagate only one property but we can easily extend it to propagate the complete options instance or even options.puppetter dedicated object. Since this library is not based on typed languages (like Typescript), it should affect future updates of puppeteer. @eshaham wdyt?

another point is to reuse the same browser instance.

I see, let's wait for @eshaham feedback about this one.

lielran commented 6 years ago

@esakal cool, let's wait to @eshaham.

once we all agree I can work on a PR for that.

eshaham commented 6 years ago

@lielran I totally agree with the need for getting better control over the instantiation of the browser instance, both for the reasons you've mentioned, and also for being able to control the version of puppeteer being used. For the first reason, we can easily send any additional props the user adds to the options object directly to the puppeteer launch function. But I think that allowing to inject the browser instance would also solve the second reason I mentioned above.

I actually wanted to take this to the next phase and allow re-using the same browser instance and tabs during a session using a pool. I need to find time to incorporate that.

Anyway, PR welcomed for injecting the browser instance.

esakal commented 6 years ago

@eshaham I'm not sure if it is possible to safely inject the browser instance from outside of the library as the internal services/utils depends heavily on the puppeteer api. A good example is the the api change in puppeteer for $$eval (#135) that broke version 0.6.0. I can think on many other places in the code that can be affected and it will open the library to problems that will be hard to maintain as you will need to make sure the code keeps backward compatibility to all puppeteer versions somehow.

Both requests can be achieved by passing additional arguments to the puppeteer (it can be an object dedicated to puppeteer that is passed as-is by the user) and by adding extra configuration that will manage different strategies for browser instances.

lielran commented 6 years ago

@esakal - can you explain a bit more why injecting will cause issues? from what I've seen you use this.browser which was the returns from the puppeteer.launch() method?

I started to work on that PR and initialize puppeteer outside the lib scope and passing the browser object to the lib and all works fine on two scrappers (Isracard, Otsar HaHayal).

esakal commented 6 years ago

@lielran try to inject old version of puppeteer, take something very old like 0.13.0 or 1.0.0. Although it seems silly choosing old version it simulates real future scenario:

  1. you create an application that uses specific puppeteer version v1.6.0 which is latest today.
  2. after 3-4 months puppeteer deploy a new version v2.0.0 with breaking api
  3. a contributor of this library updates all api accordingly and deploy new version of israeli-bank-scrapers
  4. once you will update israeli-bank-scrapers it will fail to work.

to solve # 4 one of the following should happen:

  1. you will need to update puppeteer as well which might not be what you want at the moment
  2. the contributor of israeli-bank-scrapers will need to manage compatibility internally, something that is not obvious and requires architecture change (which is possible is @eshaham will decide about it).

btw, I'm a bit more worried about scenarios where the break will not be noticeable or a one that will happen immediately which will cause the consumer of the library to break without knowing why and will be hard to detect the source of it.

lielran commented 6 years ago

@esakal In any given state israeli-bank-scrapers depended on one specific version puppeteer. if you want to inject an old version you should align with the old version of israeli-bank-scrapers.

you can't really expect it to work if you injecting very old version. another option would to use peer dependency - but this may be an overkill for starts...

the most important thing to say it will not affect all of israeli-bank-scrapers users since it is an optional way to work with the library.

eshaham commented 6 years ago

@eshaham I'm not sure if it is possible to safely inject the browser instance from outside of the library as the internal services/utils depends heavily on the puppeteer api. A good example is the the api change in puppeteer for $$eval (#135) that broke version 0.6.0. I can think on many other places in the code that can be affected and it will open the library to problems that will be hard to maintain as you will need to make sure the code keeps backward compatibility to all puppeteer versions somehow.

@esakal good point. One way around this problem is to extract the dependency on puppeteer, define an interface the browser object should implement, and implement PuppeteerBrowser class which implements this dependency (could be either on a different repo, or the same one). That way whoever injects any version of puppeteer / a different framework would be in charge of making sure the wrapper implements the required interface.

I don't know if you know this, but when I first started working on this repo I was using Phantom.js. See this old commit... It was a nightmare (funny enough I was also examining Nightmare.js 😄 )!
It is likely there would be another better framework in the future, and decoupling the module from puppeteer would be of good service to the community.

What do you guys think?

lielran commented 6 years ago

Indeed Nightmare(same goes for CasperJs). I was also my path till Google announce puppeteer, it was clear to me that it will replace all the headless browser scraping/testing tools

lielran commented 6 years ago

regarding the injecting - I think that using peer dependency and enforcing working on the same version should work and prevent issues but I wouldn't jump to that solution right now.

I think that good documentation or maybe validating puppeteer version on runtime can proper error message cover us up

esakal commented 6 years ago

@eshaham Nightmare was so buggy, I hated using it.

One way around this problem is to extract the dependency on puppeteer...

I love the suggested architecture as it will make this library more robust. But (there is always a but) I'm not sure if it worth the effort as it will complicate the maintenance and I don't think many people will use it. If you have a backlog this one should definitely be part of that list. If you still want to check this one, we can brainstorm about it.

I think that good documentation or maybe validating puppeteer version on runtime can proper error message cover us up

@lielran I'm not sure if documentation will help here as people usually just pull directly from npmjs and only open github if they have issues.

Thinking about the original request. If you just want to allow passing puppeteer instance from outside of the library, before scraping we can make sure it is the same version that is being used by this library and stop the process if it doesn't. @lielran is this what you were thinking of?

lielran commented 6 years ago

@esakal yes. question is if we can know which version of puppeteer you have in runtime and compare to the version of the library.

lielran commented 6 years ago

what do you say of the peer dependency? isn't it a good way to enforce the same version?

esakal commented 6 years ago

Hi, I didn't find a way to know at runtime what is the puppeteer version. did you find such a way? I say as this original requirement was to allow external puppeteer initialization and you achieved it with #145, it might be enough for now. It will be the responsibility of the consumer who uses this option to check for compatibility when he updates this library. I will approve your pr and will let @eshaham to merge it.

eshaham commented 6 years ago

I love the suggested architecture as it will make this library more robust. But (there is always a but) I'm not sure if it worth the effort as it will complicate the maintenance and I don't think many people will use it. If you have a backlog this one should definitely be part of that list. If you still want to check this one, we can brainstorm about it.

Agreed. It's been on my list for a while, just waiting for when we really need it, or a relevant PR.

I say as this original requirement was to allow external puppeteer initialization and you achieved it with #145, it might be enough for now. It will be the responsibility of the consumer who uses this option to check for compatibility when he updates this library.

Agree with @esakal.

I will approve your pr and will let @eshaham to merge it.

what's the status of #145? seems like we're still waiting for changes by @lielran?