text parsers aren't looking for links in content thus shouldVisit is never called

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1.create a simple crawler
2.point the crawler to any text based content (might be a text/plain content 
type) 
3.start the crawler

What is the expected output? What do you see instead?
Links on the file should be gathered and added to the queue, calling the method 
shouldVisit in the process.

What version of the product are you using?
latest from trunk/master

Please provide any additional information below.

I'm developing a crawler that gathers data from multiple sources, as I don't 
control the source, thw crawler needs to be as much flexible as possible.

Currently (despite issue 316 
https://code.google.com/p/crawler4j/issues/detail?id=316 that I fixed locally) 
the links on any plain text (or text/xml) source are never added to the queue.

I had to copy the "add to the queue" code on my crawler inside the #visit(Page 
page) method:

disclaimer: isCrawlableUrl is an internal method that uses a regex to see if 
the page should be crawled or not.

  private static final Pattern urlPattern = Pattern.compile("(?:^|[\\W])((ht|f)tp(s?):\\/\\/|www\\.)" + "(([\\w\\-]+\\.){1,}?([\\w\\-.~]+\\/?)*" + "[\\p{Alnum}.,%_=?&#\\-+()\\[\\]\\*$~@!:/{};']*)",
        Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);

@Override
public void visit(Page page) {
    InputStream is = null;
    BufferedReader in = null;
    try {
        if (page.getWebURL().getURL().endsWith(".gz")) { //hardcoded should be a parameter somwhere
            is = new GZIPInputStream(new ByteArrayInputStream(page.getContentData()));
        } else {
            is = new ByteArrayInputStream(page.getContentData());
        }

        in = new BufferedReader(new InputStreamReader(is));
        String line = null;
        while ((line = in.readLine()) != null) {
            Matcher matcher = urlPattern.matcher(line);
            while (matcher.find()) {
                int matchStart = matcher.start(1);
                int matchEnd = matcher.end();
                String url = line.substring(matchStart, matchEnd);
                if (isCrawlableUrl(url)) {
                    WebURL curURL = page.getWebURL();
                    WebURL webURL = new WebURL();
                    webURL.setURL(url);
                    webURL.setParentDocid(curURL.getParentDocid());
                    webURL.setParentUrl(curURL.getParentUrl());
                    webURL.setDepth(curURL.getDepth());
                    webURL.setDocid(-1);
                    webURL.setAnchor(curURL.getAnchor());
                    if (shouldVisit(webURL) && getMyController().getRobotstxtServer().allows(webURL)) {
                        webURL.setDocid(getMyController().getDocIdServer().getNewDocID(url));
                        getMyController().getFrontier().schedule(webURL);
                    }
                }
            }
        }
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } finally {
        if(is != null){
            try {
                is.close();
            } catch (IOException e) {
                // TODO Auto-generated catch block
                e.printStackTrace();
            }
        }
    }

}


Original issue reported on code.google.com by `panthro....@gmail.com` on 16 Nov 2014 at 5:21

GoogleCodeExporter commented 9 years ago

I really need a solid example - one URL which is plain/text which fails on your 
crawler so I can test on my environment

Original comment by avrah...@gmail.com on 16 Nov 2014 at 5:37

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

use http://dx.com/sitemap.xml with the 316 fix code.

Original comment by panthro....@gmail.com on 16 Nov 2014 at 5:52

GoogleCodeExporter commented 9 years ago

Works

Original comment by avrah...@gmail.com on 16 Nov 2014 at 5:59

Changed state: Invalid

asepaprianto / crawler4j

text parsers aren't looking for links in content thus shouldVisit is never called #317