algolia / docsearch-scraper

DocSearch - Scraper
https://docsearch.algolia.com/
Other
309 stars 108 forks source link

Crawl JS generated docs #13

Closed ElPicador closed 8 years ago

ElPicador commented 8 years ago

Some documentation are generated client side with JS (ex: http://docs.prezly.com/, https://gns3.com/support/docs/quick-start-guide-for-windows-us).

It would be nice to be able to parse them

ElPicador commented 8 years ago

Note to myself: Add search on gns3: https://secure.helpscout.net/conversation/154616922/4890/?folderId=696715

pixelastic commented 8 years ago

This looks like a pretty big enhancement, considering that the underlying engine we're using for scrapping (Scrapy) only do static HTML parsing. For SPA application, I would rather try to hit the API level if possible, or say that DocSearch is not compatible with their documentation.

For Prezly, they are using readme.io, maybe we could create something directly on readme.io level

proudlygeek commented 8 years ago

I'd say it's more a feature than an enhancement.

I would personally go with an optional HTTP Proxy which can process JavaScript (PhantomJS / Selenium) documentations and feed the resulting static page into Scrapy / Python. What do you think about this approach?

ElPicador commented 8 years ago

There is Scrapy for JS: https://github.com/scrapinghub/scrapy-splash

proudlygeek commented 8 years ago

@ElPicador very cool! As I can see it's basically what I said, just more handy and already Dockerized :smile: did you already give it a try?

ElPicador commented 8 years ago

Never, @redox was the one who told me about it

redox commented 8 years ago

@ElPicador @proudlygeek @pixelastic We've been thinking of making it the onboarding project of @aseure :)

proudlygeek commented 8 years ago

Awesomeness!!! :100: :+1:

aseure commented 8 years ago

I've opened a PR to address those problematic documentations. Please see https://github.com/algolia/documentation-scrapper/pull/46.

pixelastic commented 8 years ago

I think this can be closed