Norconex / committer-core

Norconex Committer is a java library and command line application used to route content to local or remote target repositories, such as a search engine index.
http://www.norconex.com/collectors/committer-core
Apache License 2.0
4 stars 10 forks source link

Change XML output file name #13

Closed fleitonSearch closed 7 years ago

fleitonSearch commented 7 years ago

Hi Pascal,

I was taking a look at the XMLFileCommitter in this webpage: https://www.norconex.com/collectors/committer-core/latest/apidocs/com/norconex/committer/core/impl/XMLFileCommitter.html

and I was wondering if in that functionality you guys have, is anyway to change the name for the XML output file, instead of getting an XML with the timestamp of the crawl??

You know... Something like this: if we are crawling "www.example.com/A1" website and it has a title like "Example 1", to be able to print the XML output file with the name of the title of the website, you know... something like Example 1-.xml or "A1-.xml". Basically, any way to change the output file name to something more representative and to find it easier when I'm crawling thousands of pages.

Thanks for your time! :)

essiembre commented 7 years ago

I am afraid this is not a simple. Depending on someone's configuration, a single XML file could contain an infinite number of documents from various websites. How can we give a unique name to the file then?

The latest committer-core snapshot now has the options to specify optional fileNamePrefix and fileNameSuffix. This is not what you described, but it can get you closer. If you have multiple crawlers, each crawling a unique site, you could prefix the file names with the name of the site a crawler targets.

If you have thousands of files and have a hard time locating specific ones, I would consider using a different Committer. Like one that stores in a search engine or SQL database? There is also grep (or Windows equivalent). :-)

fleitonSearch commented 7 years ago

Yeah is not what I described but it will help. I'm gonna store it in Solr, but first I'm exploring functionalities.

Thanks Pascal! 👍

essiembre commented 7 years ago

No problem.