Closed ericwhiteau closed 3 years ago
Are you talking about web pages that are Ajax/JavaScript-driven? If so, you may want to use web browser capabilities for crawling. Have a look at: https://github.com/Norconex/collector-http/issues/739#issuecomment-799107309
thanks, its not a Javasript driven site, just has one json object which has metadata that they are reading in Google Analytics.
I want several of the same metrics. I was trying to read it without going down the headless client approach.
I may just try to regex the content that I want.
thanks for looking.
Have you found a way to achieve what you want?
If not, I suggest you look at the DOMTagger to first extract the <script>
content and store it in a field of your choice. It could look like this (untested):
<tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger">
<dom selector="script[data-drupal-selector=drupal-settings-json]" extract="data" toField="MyJson" />
</tagger>
Then, you can use the ScriptTagger to read the JSON object and store what you want from it into new fields.
Thank you, I have been working on other parts of this project, but will test your suggestion thanks.
all going well so far.
cheers.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Hello I am setting up norconex with the Cloud Search plugin,
I have had some success with a small test, now attempting to setup for a full site.
standard tagging is working and gets much of the configuration, (which is great).
there are some more specific data points in a script element "application/json"
I had a look at the DOMTagger, and ScritpTagger
but not sure how to reference and get results out?
here is an example snippet. assume I want to get one of the elements from the serialised JSON as a tag. what would be your suggested approach?
thanks in advance. Eric.