Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

Collect and commit in parallel #667

Closed paulmilner closed 4 years ago

paulmilner commented 4 years ago

Hi, sorry if I've misunderstood the configuration, but I'm using the example config to crawl data and then commit it to Azure Search. I was sort of expecting it to crawl and commit in parallel, but it seems to be doing all the crawling first, and then only when that's finished, start committing. Is that the intention, or am I doing it wrong?

LeMoussel commented 4 years ago

Of my understanding, Yes collector-hhtp do all the crawling first, and then only when that's finished, start committing. In the configuration file, it's possible to adapt the value of the commitBatchSize tag that is the maximum of documents to send to comitter at once.

paulmilner commented 4 years ago

OK, thanks for the info

essiembre commented 4 years ago

@paulmilner and @LeMoussel, a small clarification: The queueSize configuration option in the committer dictates after how many successfully processed pages it will start sending batches to Azure. As an example:

<queueSize>100</queueSize>: Once the queue reaches 100 documents, send them to Azure. <commitBatchSize>20</commitBatchSize>: When sending the 100 documents from the queue, send them in batches of 20 documents to Azure (resulting in 5 batches of 20 documents in this example).

paulmilner commented 4 years ago

@essiembre Thanks Pascal, that's just the info I need: it does not do ALL the crawling followed by ALL the committing, that was just what I saw in my case. By varying those properties I can make it do some committing whilst still crawling.

LeMoussel commented 4 years ago

@paulmilner To test the AbstractMappedCommitter parameters, I created the CustomLoggingCommitter class that logs the Add & Delete operations. If you want I can make it available here.

paulmilner commented 4 years ago

@LeMoussel @essiembre Thanks, I would be interested to see that as I might have to write a committer myself, as I have to find a way to send crawled docs to temporary storage for further processing which is not possible within the Norconex products. At the risk of widening this thread too much, is a "committer" the right component to be doing that in? I mean taking the actual crawled files (whether HTML, JPG, PDF, MP3, MP4, or whatever), not just the extracted data or metadata, and sending them to storage?

LeMoussel commented 4 years ago

I'll let @essiembre answer your question.

package coweb;

import java.util.List;

import javax.xml.stream.XMLStreamException;
import javax.xml.stream.XMLStreamWriter;

import com.norconex.committer.core.AbstractMappedCommitter;
import com.norconex.committer.core.CommitterException;
import com.norconex.committer.core.IAddOperation;
import com.norconex.committer.core.ICommitOperation;
import com.norconex.committer.core.IDeleteOperation;
import com.norconex.commons.lang.map.Properties;

import org.apache.commons.configuration.XMLConfiguration;
import org.apache.log4j.LogManager;
import org.apache.log4j.Logger;

public class CustomLoggingCommitter extends AbstractMappedCommitter {
  private static final Logger LOG = LogManager.getLogger(CustomLoggingCommitter.class);

  public CustomLoggingCommitter() {
    super();
  }

  @Override
  protected void commitBatch(List<ICommitOperation> batch) {
    for (ICommitOperation op : batch) {
      if (op instanceof IAddOperation) {
        IAddOperation opAdd = (IAddOperation) op;
        final Properties p = opAdd.getMetadata();
        String referrer = p.getString("collector.referrer-reference");
        List<String> redirectTrail = p.get("collector.redirect-trail");

        LOG.info("Committer Add Operation: " + opAdd.getReference());
        LOG.info("\t referrer: " + referrer);
        if (redirectTrail != null && redirectTrail.size() > 0) {
          LOG.info("\t redirectTrail: " + redirectTrail.toString());
        }
      } else if (op instanceof IDeleteOperation) {
        IDeleteOperation opDel = (IDeleteOperation) op;

        LOG.info("Committer Delete Operation: " + opDel.getReference());
      } else {
        throw new CommitterException("Committer Unsupported Operation:" + op);
      }
    }
  }

  @Override
  protected void loadFromXml(XMLConfiguration xml) {
  }

  @Override
  protected void saveToXML(XMLStreamWriter writer) throws XMLStreamException {
  }
}
essiembre commented 4 years ago

I responded in #674. Please try not to start new threads in closed tickets as they are easy to miss.