apache / incubator-stormcrawler

A scalable, mature and versatile web crawler based on Apache Storm
https://stormcrawler.apache.org/
Apache License 2.0
887 stars 262 forks source link

Follow refresh redirects #928

Closed juli-alvarez closed 2 years ago

juli-alvarez commented 2 years ago

Hi @jnioche! Currently we are using a custom protocol when fetching that does follow redirections. So far so good until we found a case where the redirection is being done by meta refresh tag in the parsing phase. This lead to unexpected and inconsistent results, like having two documents created in the ES index, the original seed with status REDIRECTION and status.code 200 and the redirected url with status FETCHED. I want to propose a simple change in JSoupParserBolt that adds the followRefreshRedirects flag (defaulting to true) in order to be able to skip this behavior and just get the original seed with status FETCHED and status.code 200 in our example. Please let me know what do you think and if you consider it will be helpful for others with this use case I will create the PR. Thanks!

jnioche commented 2 years ago

hi @juli-alvarez, thanks for asking.

Can't you get the same result by setting redirections.allowed: false in the config? If your protocol follows redirections (which is often the case if you use Selenium or Playwright) then they will be handled there and only there?

juli-alvarez commented 2 years ago

Hi @jnioche, thanks for the quick response! Exactly. For those seeds we crawl using follow redirections we just delegate this task to the ok http client in the fetcher bolt. We would expect to see that seed not marked as REDIRECTION but instead as FETCHED or ERROR. So, it's ok for us to use the redirections.allowed setting but it should include the whole redirection block. ` // redirection? try { String redirection = null;

        Element redirElement = jsoupDoc
                .selectFirst("meta[http-equiv~=(?i)refresh][content]");
        if (redirElement != null) {
            redirection = RefreshTag.extractRefreshURL(redirElement
                    .attr("content"));
        }

        if (StringUtils.isNotBlank(redirection)) {
            // stores the URL it redirects to
            // used for debugging mainly - do not resolve the target
            // URL
            LOG.info("Found redir in {} to {}", url, redirection);
            metadata.setValue("_redirTo", redirection);

            if (allowRedirs() && StringUtils.isNotBlank(redirection)) {
                emitOutlink(tuple, new URL(url), redirection, metadata);
            }

            // Mark URL as redirected
            collector
                    .emit(com.digitalpebble.stormcrawler.Constants.StatusStreamName,
                            tuple, new Values(url, metadata,
                                    Status.REDIRECTION));
            collector.ack(tuple);
            eventCounter.scope("tuple_success").incr();
            return;
        }
    } catch (MalformedURLException e) {
        LOG.error("MalformedURLException on {}", url);
    }

` Currently is just preventing the redirection to be emitted as DISCOVERED and the line with the emit as REDIRECTION is giving us problems. What do you think? Thanks!

jnioche commented 2 years ago

thanks, @juli-alvarez, I see what you mean. It would be better to have a separate configuration for this (jsoup.ignore.meta.redirections with a default value of false?) as it is not quite the same as redirections.allowed. This would cover the whole block. Any chance you could contribute a PR for this?

juli-alvarez commented 2 years ago

Hi @jnioche! Totally agree with you. PR created: https://github.com/DigitalPebble/storm-crawler/pull/930 Thanks!!