apache / incubator-stormcrawler

A scalable, mature and versatile web crawler based on Apache Storm
https://stormcrawler.apache.org/
Apache License 2.0
887 stars 262 forks source link

Does Stormcrawler follow secondary JavaScript page content loads? #639

Closed tony-boxed closed 5 years ago

tony-boxed commented 5 years ago

From looking at my scraped results for webmd.com, it seems it may not and I guess it's way too much to expect that it would since that would be very complicated. But I figured I'd ask anyway to double check.

So, if I have a page that uses JavaScript to load its body after the initial page load, does Stormcrawler have any method by which it will wait for this secondary content to load and then scrape the page?

I imagine no crawler does this except very very high level and complicated crawlers like what Google or Bing might use - or maybe even they don't since it would require browser-level intelligence and complexity. The thought of how you'd even implement a behavior of this stature is anxiety-producing.

kkrugler commented 5 years ago

Hi @tony-boxed - as @jnioche noted on your previous issue, please ask questions on Stack Overflow, versus opening up issues on GitHub, thanks.

tony-boxed commented 5 years ago

Whoops - thanks!