Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

custom metadata tag - google cloud search plugin #752

Closed ericwhiteau closed 3 years ago

ericwhiteau commented 3 years ago

Hello I am setting up norconex with the Cloud Search plugin,

I have had some success with a small test, now attempting to setup for a full site.

standard tagging is working and gets much of the configuration, (which is great).

there are some more specific data points in a script element "application/json"
I had a look at the DOMTagger, and ScritpTagger

but not sure how to reference and get results out?

here is an example snippet. assume I want to get one of the elements from the serialised JSON as a tag. what would be your suggested approach?

<script type="application/json" data-drupal-selector="drupal-settings-json">
{"path":{"baseUrl":"\/","scriptPath":null,"pathPrefix":"","currentPath":"node\/22191","currentPathIsAdmin":false,"isFront":false,"currentLanguage":"en"}, ...... }

thanks in advance. Eric.

essiembre commented 3 years ago

Are you talking about web pages that are Ajax/JavaScript-driven? If so, you may want to use web browser capabilities for crawling. Have a look at: https://github.com/Norconex/collector-http/issues/739#issuecomment-799107309

ericwhiteau commented 3 years ago

thanks, its not a Javasript driven site, just has one json object which has metadata that they are reading in Google Analytics.

I want several of the same metrics. I was trying to read it without going down the headless client approach.

I may just try to regex the content that I want.

thanks for looking.

essiembre commented 3 years ago

Have you found a way to achieve what you want?

If not, I suggest you look at the DOMTagger to first extract the <script> content and store it in a field of your choice. It could look like this (untested):

  <tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger">
      <dom selector="script[data-drupal-selector=drupal-settings-json]" extract="data" toField="MyJson" />
  </tagger>

Then, you can use the ScriptTagger to read the JSON object and store what you want from it into new fields.

ericwhiteau commented 3 years ago

Thank you, I have been working on other parts of this project, but will test your suggestion thanks.

all going well so far.

cheers.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.