Norconex / committer-core

Norconex Committer is a java library and command line application used to route content to local or remote target repositories, such as a search engine index.
http://www.norconex.com/collectors/committer-core
Apache License 2.0
4 stars 10 forks source link

Custom MySQL commiter implementation #3

Closed AntonioAmore closed 10 years ago

AntonioAmore commented 10 years ago

Hello!

I write my own committer implementation to put collected pages into MySQL database.

As an example I've taken SolrCommiter - is it a right decision?

So I inherited from AbstractMappedCommitter and implementing commitBatch(List list) method get following questions:

  1. Have I to insert to the MySQL DB all list's items at the method? (On my opinion it puts an accumulated portion of data to the storage)
  2. How could I get such metadata from an item: url, date of crawling, text of page.
  3. For what did you used the factory at SolrCommiter? I try to get do I need it too.
  4. How may I set batch size, for example in config? There were some methods in SolrCommiter, but they marked as deprecated. How to do it on the right way?

I tried to handle it myself, but sink in code - not experienced in Java yet. Thank you a lot.

essiembre commented 10 years ago

Hello, First, congrats in writing your own Committer! If you feel it can be generic enough when you are done, I can link to it as a third party contribution if you like.

Now some answers:

0 - Is using AbstractMappedCommitter the right decision? It depends. What that class those is take care of document queuing for you so you can commit in batch, and the "Mapped" in the name means it offers configuration options to map the ID and content fields from crawled document metadata to the ID and content field in your target repo (MySQL in your case). If you do not care about submitting in batches or you do your own metadata-fields-to-table-field associations in the code, you can opt for implementing the ICommitter interface directly, where in the queueXXX methods you insert into MySQL, and in the commit() method, you commit the database (just a suggestion).

  1. You are right, it keeps files on the local file system, in a queue, until you process them all in that method. That's for batching, since target repositories are often more efficient when doing batch inserts. With your MySQL instance, you can look if you can benefit from batching. If not, consider simply implementing ICommitter to push directly to MySQL as documents are ready (putting the batch size to zero in AbstractMappedCommitter may also do this I suppose).
  2. Unless you restricted them using something like a KeepOnlyTagger in your config, all extracted metadata is kept and available in the Properties metadata variable passed to the committer methods. What I recommend you do, is write yourself a bit of code that will print the content of that variable, or temporarily use the FileSystemCommitter and open up a generated *.meta file to get a list of all fields attached to your documents. You can then pick and chose, and even rename or manipulate these fields before sending (either using existing taggers, transformers, etc in your config or programmatically).
  3. You do not need to create such a factory, especially if your code is not meant to be generic. For Solr, there are a few different classes you can use to connect to it, and the factory is just so people can create custom ways to connect while using the the rest of the Solr Committer as is (the factory is passed in the constructor). It is not something mandated by the API at all.
  4. You'll find in the javadoc the configuration details (http://www.norconex.com/product/committer/apidocs/com/norconex/committer/AbstractMappedCommitter.html). There are two configuration options for size using that committer :
<commitBatchSize>
    (max number of documents to send to target repository at once)
</commitBatchSize>
<queueSize>
    (max queue size before committing)
</queueSize>

Please let me know if that answers your question or if you have more. Thanks.

AntonioAmore commented 10 years ago

Thanks a lot for your answer.

I've chosen the AbstractMappedCommiter because really want to map metadata to different fields of database table and make it highly configurable. Tell me please

  1. From the link you provided I get following info:
    • idSourceField - name of field from metadata representing data source (url from which the sample have been collected)
    • idTargetField - name of MySQL field to store the url
    • contentSourceField - metadata field containing the collected page
    • contentTargetField - MySQL field to put page's content
      And those fields should be described in config XML at Commiter's configuration. Am I right?
  2. commitBatch() gets a list containing commitBatchSize (or less) of elements?
  3. Let's suppose commitBatch() method contains such lines:
for (ICommitOperation iCommitOperation : list) {
                if (iCommitOperation instanceof IAddOperation) {
                    /*
                     * 
                     */
                } else if (iCommitOperation instanceof IDeleteOperation) {
                    /*
                     * 
                     */
                } else {
                   throw new CommitterException("Unsupported operation:" + iCommitOperation);
                }
            }

I viewed metadata files and may recognize fields names there, but still unable to get how to read metadata, or mapped fields inside the function to write MySQL query. Could you provide a link to a file from the project which may be used by me as an example. IAddOperation methods dont helped me. Sorry for asking such elementary things.

essiembre commented 10 years ago

Don't be sorry for asking!

  1. While it is perfectly fine to set those values programmatically, you are right, those are typically configured via XML. Keep in mind only the id and content are dealt with in this case. This is by convenience since those two are often required by target systems, while other fields can be anything. If you have many fields you would like to remap. I would not necessarily try to reinvent the wheel when there are importer handlers to do so, like the RenameTagger
  2. You are correct. Basically the quantity of operations being passed will be no larger than that configuration value. The idea is for you to send all of it to MySQL at once.
  3. In your case like pretty much any case using the batch committer, your operations will implement either IAddOperation or IDeleteOperation. The IAddOperation has a getMetadata() method on it. It returns a Properties object, which holds all the fields it detected so far. So you would retrieve them by calling get methods on that Properties file, such as getString("myTextFieldName"), or getInt("myNumericFieldName"). If you call the keySet() method on it, you will get all metadata field names present.

Any clearer?

AntonioAmore commented 10 years ago

Thank you!

That's clear, typecasting to more specialized interfaces like IAddOperation works and shows me the picture.

And it's sad that I have not enough experience now to provide my commiter to community - it still too special. Hope in the future I can program it on more generic manner.

essiembre commented 10 years ago

No worries, you have to start somewhere and you seem on the right track. Keep it up! :-)