Norconex / committer-core

Norconex Committer is a java library and command line application used to route content to local or remote target repositories, such as a search engine index.
http://www.norconex.com/collectors/committer-core
Apache License 2.0
4 stars 10 forks source link

Committer queue is not fully processed #16

Open jsteggink opened 6 years ago

jsteggink commented 6 years ago

The committer queue is not fully processed because it's capped by the queueSize property. Since the queue can be bigger than the queueSize and is only called after a commit, the file queue grows and grows.

https://github.com/Norconex/committer-core/blob/master/norconex-committer-core/src/main/java/com/norconex/committer/core/AbstractFileQueueCommitter.java#L175

essiembre commented 6 years ago

When using AbstractFileQueueCommitter directly, the commit will only becalled at the end like you mention, unless you call it yourself more frequently. If you want to be called after X number of documents, have a look at the AbstractBatchCommitter subclass, which does it for you.

Does that address your issue?

jsteggink commented 6 years ago

Thanks for your reply Pascal. I came across this issue while using the Solr Committer. Since it implements the AbstractFileQueueCommitter (Committer Core) it's why I'm posting the issue here.

The commit is also called by commitIfReady (AbstractCommitter) which in turn is called by the "add" and "remove" methods in AbstractCommitter. The commitIfReady also checks the queue size. This means that commit is only run from here. However, since the queue sizes in commitIfReady() and commit() are the same and the code is asynchronous, as multiple threads can call the methods, the queue grows and grows because items are added to the queue while the commit takes a bit of time to be processed. A quick solution would be to remove the queue limit in the commit() method.

Many things happen in the commit() method. Mainly because of rereading the complete directory of the file queue and because of the iteration of the filesToCommit.

I would suggest to build a more robust commit queue, maybe based on events. Something like RxJava could help to make the committer-core more pluggable so people can implement there own queues.

What do you think?

essiembre commented 6 years ago

If it can't keep up right now, you may have to slow it down, unfortunately. I agree the queue could be improved and I am already sold to the idea of being able to supply your own queue. I am marking this as a feature request.

The Committers will be seriously revisited in the next major version and something like RxJava will be given consideration. Have you used RxJava in a few projects yourself? Any examples?

jsteggink commented 6 years ago

I have some experience with Reactor, which is another Reactive Streams framework. It's also used by the Spring framework. It would be my first choice as it's targeted to Java 8 and easily integrates with Kafka and RabbitMQ.

truezjz commented 5 years ago

Hi Pascal,

I'm also facing this issue, after crawling 174K files, there are 12000 files left in the commiter-queue folder not processed. Understand the issue will be addressed in next version, is there a walk around for the time being?

jsteggink commented 5 years ago

Last year I fixed it. I can come up with a pull request tomorrow.

truezjz commented 5 years ago

thanks for for prompt response Jeroen, I will check it out, that will be in the commiter-core right?

jsteggink commented 5 years ago

Yes, it's the committer-core. I need a little bit of extra time to make some unit tests and I have some stuff I haven't committed yet in my fork. In the meantime you can take a look of what I did: https://github.com/jsteggink/committer-core I have added a reactive committer and a persistent queue based on RocksDb. This makes the committer-core more stable, way faster and potentially more scalable. In the future different persistent queue implementations could be added.

truezjz commented 5 years ago

Hi Jeroen, Thanks for the update, as you mentioned current code in your fork is not your final submit, looking forward to final submit; Meanwhile as you mentioned, for temporary solution, I can set the queuesize in commiter to unlimited number, but the disadvantage is, it need large storage to hold the committer queue, correct?

truezjz commented 5 years ago

@jsteggink Jeroen any update? if need I can help with the test.