Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

Is it possible to store list of object in metadata? #659

Closed LeMoussel closed 4 years ago

LeMoussel commented 4 years ago

I have List of object like this :

 { "url": "http:www.example.com/url1", "class": "classA" },
 { "url": "http:www.example.com/url2", "class": "classB" },
 { "url": "http:www.example.com/url3", "class": "classA" },
.....

How do I store this list in metadata?

essiembre commented 4 years ago

Can you give a bit more context? Is this list found in an HTML page, within <script> tags? Is it a .json file? In other words, where does it come from?

LeMoussel commented 4 years ago

Yes list is found in HTML page with IDocumentTagger implementation like this :

public class customObject
{
  private String href;
  private String hrefClass;

  public customObject(String href, String hrefClass)
  {
    this.href = href;
    this.hrefClass = hrefClass;
  }

  public String toString()
  {
    return "href: " + href + " hrefClass: " + hrefClass;
  }
}

public class CustomDocumentTagger implements IDocumentTagger {
  /**
   * Tags a document with extra metadata information.
   * @param reference document reference (e.g. URL)
   * @param document document
   * @param metadata document metadata
   * @param parsed whether the document has been parsed already or not (a
   *        parsed document should normally be text-based)
   * @throws ImporterHandlerException problem tagging the document
   */
  public void tagDocument(final String reference, final InputStream document, final ImporterMetadata metadata,
      final boolean parsed) throws ImporterHandlerException {

      /* Do some stuff to extract link informations .... */

      // For example
      List<customObject> listCustomObject = new ArrayList<customObject>();    

      listCustomObject.add(new customObject("http:www.example.com/url1", "classA"));
          listCustomObject.add(new customObject("http:www.example.com/url2", "classB"));
          listCustomObject.add(new customObject("http:www.example.com/url3", "classA"));

      /* For Debug 
      for (customObject myCustomObject : listCustomObject) {
             System.out.println(myCustomObject);
      }
      */ 

      // => How to store listCustomObject in metadata?
          metadata.????("collector.extend-referenced-urls", listCustomObject);
  }
}
essiembre commented 4 years ago

Metadata are stored as (multi-value) strings, so you would have to convert your list to a string equivalent (e.g. JSON) and store it in the metadata field of your choice (new or existing).

Maybe there are ways to do it via configuration as well. If you are dealing with JSON, the format should be fairly static. If so you can try using existing taggers such as the TextPatternTagger:

<tagger class="com.norconex.importer.handler.tagger.impl.TextPatternTagger">
  <pattern field="collector.extend-referenced-urls">
    \{\s*"url":\s*".*?",\s*"class":\s*".*?"\s*\}
  </pattern>
</tagger>
LeMoussel commented 4 years ago

Great. Thanks a lot.