Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

id must be less than 128 characters #323

Closed mhaegeleD closed 7 years ago

mhaegeleD commented 7 years ago

Hallo, I have tried to use the collector in combination with the AWS-Cloudsearch-Comitter.

And I have (still) 2 problems: 1) is there any chance to commit the crawled-results again to AWS in case of something failed, hours of crawling are lying on the harddisk, but there was problems to transmit it to AWS (first problem was the results got too big ("Request size exceeded 20971520 bytes". Any hints how to prevent this error?), second problem in next (smaller) try: There was an ID longer than 128 characters:-( )

2) How can I ensure, that the generated ID isn't longer than 128 characters, because then AWS says no: """id" must be less than 128 characters" and everything fails

Thank you very much for your help,

Many Greetings Michael

essiembre commented 7 years ago

Those are limitations of AWS CloudSearch. You can try to work around them with the following:

Problem 1: Request size exceeded 20971520 bytes AWS CloudSearch restrict document size to 1MB and a document batch to 5MB. You can try to eliminate batch size errors by setting the committer <commitBatchSize> to a very low value.

Eliminating the individual document size problem can be more challenging. If you assume 2-bytes per characters (UTF-8), you can try the following as a post-parse handler in your importer section.

<tagger class="com.norconex.importer.handler.tagger.impl.ReplaceTagger">
      <replace fromField="document.reference">
          <fromValue>^(.{1,2500000})</fromValue>
          <toValue>$1</toValue>
      </replace>
</tagger>

Problem 2: "id" must be less than 128 characters In addition to its length, AWS CloudSearch mandates that the ID only contains alphanumeric characters plus these: _ - = # ; : / ? @ &. The following should address both (goes in your crawler config):

  <urlNormalizer class="com.norconex.collector.http.url.impl.GenericURLNormalizer">
    <replacements>
      <replace><match>&amp;view=print</match></replace>
      <replace>
         <match><![CDATA[[^\w\-=#;:/\?@&]]]></match>
         <replacement>_</replacement>
      </replace>
      <replace>
         <match>^(.{1,127})</match>
         <replacement>$1</replacement>
      </replace>
    </replacements>
  </urlNormalizer>

There is a problem with this approach. You may create duplicate URLs if the truncated portion is what makes a URL unique. If this is a problem and you know of a unique metadata field in pages you crawl, it maybe best to use that as the CloudSearch ID. Alternatively, you can create your own IURLNormalizer to dynamically generated a unique ID that is constant for each doc and short enough.

mhaegeleD commented 7 years ago

Thank you very much!