Closed liar666 closed 8 years ago
About the content, taggers do not modify the content, just the fields/metadata. Looking at your config, the HTML will be parsed normally and it is expected for you to get the content. "Transformers" will modify the content if that's what you want.
About the image tag you do not have... it is also normal since you have your DOM tagger as a POST-parse handler. Once parsing occured, the original document is gone and you are left with plain text (it is no longer an HTML document).
To work with a DOMTagger you have to follow the advice in its documentation "Should be used as a pre-parse handler." Give that a try, but make sure you do not put your KeepOnlyTagger
after or that will get rid of your IMAGE
field.
Also, since the config file is XML, in case this makes a difference I would suggest you escape your angle bracket: selector="a>img"
.
The metadata/fields are stored in a .meta file when you use the FileSystemCommitter. Are you using the FileSystemCommitter just to troubleshoot for now? Because once you are satisfied with your crawler config, it is highly recommended you use your own committer instead (to avoid needing a separate process that reads those files).
Do you still have questions/issues related to this ticket or can we close?
FYI, a new snapshot release of HTTP Collector was made with an updated Importer module in it. It gives DOMTagger more options.
Hi,
Sorry I had to switch to another task and I'm coming back to this one only now.
Thanks for the explanations. It appears that I'm still lacking a lot of understanding of the inner working of the tool (particularly the pre/post handlers, when is HTML/text handled, and Taggers/Transformers differences...). That's what comes with learning while doing... Whatever, I quite like the way Norconex's collector is designed (particularly with respect to Heritrix): it seems to rely on the composition of simple elements, which is more in the KISS/Unix way of designing/using tools and allows more flexibility & modularity, at the same time as keeping small and human-readable config files...
Yes, I'm using FileSystemCommitter for the moment to "get my grips" on the tool, but I'll of course implement a specific committer when I'll need to put our crawlers in production.
Thanks for the good feedback! While sometimes difficult to apply, KISS/flexibility/modularity are indeed very important design drivers to us. I am glad you recognise that.
I just ran the following simple crawler:
Output was:
There is no error in the logs, even if the extract field is wrong??!!
By the way, even after correcting this mistake, this crawler does not seem to work (crawledFiles/xxx/xxx.cntnt contains the whole page as text instead of an IMAGE tag, xxx.meta and xxx.ref are OK the document reference). Do you have a hint where I'm mistaking?