bsgreenb / Open-Textbooks

A library for scraping nearly every college bookstore in the US
MIT License
65 stars 19 forks source link

BNCollege uses Javascript redirect to prevent scraping #9

Open ravirahman opened 8 years ago

ravirahman commented 8 years ago

It appears that BNCollege, and potentially others, use a javascript redirect to prevent scraping. Here is a workaround (on Android) using a webview: WebView view;

Then in, for example, onCreate, view = new WebView(getApplicationContext()); view.getSettings().setJavaScriptEnabled(true); view.getSettings().setUserAgentString("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.1 Safari/537.36"); view.getSettings().setLoadsImagesAutomatically(true); CookieManager.getInstance().setAcceptCookie(true); view.loadUrl("http://milton.bncollege.com/webapp/wcs/stores/servlet/TBWizardView?catalogId=10001&langId=-1&storeId=82238"); CookieManager.getInstance().setAcceptCookie(true); CookieManager.getInstance().setAcceptThirdPartyCookies(view,true); start();

Then a timer to check when the javascript redirect is complete ` private Timer timer; private TimerTask timerTask = new TimerTask() {

    @Override
    public void run() {
        runOnUiThread(new Runnable() {
            @Override
            public void run() {
                view.evaluateJavascript("(function() { return document.getElementsByClassName(\"bncbOptionItem\")[0].outerHTML; })()", new ValueCallback<String>() {
                    @Override
                    public void onReceiveValue(String value) {
                        System.out.println("hi there");
                        System.out.println(value);
                    }
                });
            }
        });

    }
};

public void start() {
    if(timer != null) {
        return;
    }
    timer = new Timer();
    timer.scheduleAtFixedRate(timerTask, 0, 2000);
}

public void stop() {
    timer.cancel();
    timer = null;
}`
bsgreenb commented 8 years ago

Thanks Ravi. Would love it if you or someone else could post more up to date scrapers. A lot of people rely on these so it'd be nice if we could help more people out with up to date scrapers rather than keeping them all to ourselves..

Thanks!