Closed nuliknol closed 8 years ago
this is the page I use as the start url:
<html>
<a href="http://www.zzzzzzzz.com/ukr/">one</a>
<Br>
<a href="http://www.xxxxxxxx.nz/">two</a>
</html>
<!-- Execution starts at 1456861631.4395 (1456861631)
Time total: 0.018090009689331 (1456861631.4576) diff=0
-->
So, it gets this page, processes it, and loops.
I have been testing it and, I found that the bug is not always reproducible. In about 3 out of 10 times it loops and the other 70% of times it runs fine. No web traffic is observed during the loop on ethernet card. So, I suspected it might be a race condition in thread communication. I changed to 5 threads. (because in my example, I have 100 threads, and my page "two" has like 5,000 links) With 5 threads I wasn't been able to reproduce this bug no matter how much I ran the tests. But wait, I have a 1,000 milliseconds delay, that will make any thread locking issue to hide!! So, I changed the delay to 0, and increased the number of threads to 200. With this configuration I have been able to lock the crawler in first 10 seconds of run in about 50% of the runs. Apparently it doesn't matter what filters are there, or what pages you download, just running many threads with 0 delay will create thread locking issue, and it loops forever. I observe 100% CPU usage, with no network traffic detected on the ethernet interface. Race conditions between threads are very difficult to find, you have to check all the places where the threads make locks, and analyze if there is a possibility of a lock between threads. In my tests locks start from using 10 threads with 0 delay but on another machine it may be different number. Also, it doesn't mean that if you put threads to like 5,000 it will lock up, because by the time it gets to the thread #4999 your first thread may be releasing all resources it had, so you may not get into race condition. Also, a good testing page for this use case, would be a dummy site with a lot of links and no images or other objects. Because, in my case, I do not download any object different from html/text, this makes the crawler process the data fast, and it is more prone to create a race condition between threads. I am available for debugging if you tell me what to do.
Looks like you uncovered a nasty one. I had people use it very aggressively before without triggering that condition. I will try my best to reproduce.
I am running it right now with 200 threads, but a 5,000 delay, no problems so far. Once it starts crawling , I suspect it will be difficult to enter the lock. To reproduce, you have to direct the crawler to a starting page with hundreds of links.
To make it easier for you I made a video of the bug. We crawl sensitive information so if you give me an email address I can send you the link to download the video and also configuration files so you can reproduce it on your side and make tests.
You can use the email address in my github profile.
I see in your stacktrace the issue comes from MapDB. I can try to upgrade to the latest version of MapDB to see if that fixes the problem but before I do so, can you try with another database to store the URL? Can you try this inside your crawler configuration:
<crawlDataStoreFactory
class="com.norconex.collector.core.data.store.impl.mvstore.MVStoreCrawlDataStoreFactory" />
Nope. Not reproducible with MVStore. Tried about 25 times, runs perfectly. Switched back to MapDB and got it locked on the first one. It is a MapDB issue then.
Have you been able to reproduce in your environment?
On Tue, Mar 1, 2016 at 9:36 PM, Pascal Essiembre notifications@github.com wrote:
I see in your stacktrace the issue comes from MapDB. I can try to upgrade to the latest version of MapDB to see if that fixes the problem but before I do so, can you try with another database to store the URL? Can you try this inside your crawler configuration:
<crawlDataStoreFactory class="com.norconex.collector.core.data.store.impl.mvstore.MVStoreCrawlDataStoreFactory" />
— Reply to this email directly or view it on GitHub https://github.com/Norconex/collector-http/issues/234#issuecomment-191040442 .
I was about to spend time to reproduce but the moment I saw it was MapDB I stopped. You confirmed it is indeed the issue so I suggest you stick with MVStore for now. I will leave this ticket open as a bug. I will upgrade the MapDB library when I get a chance and have you test again if you do not mind.
perfect!
On Wed, Mar 2, 2016 at 10:18 AM, Pascal Essiembre notifications@github.com wrote:
I was about to spend time to reproduce but the moment I saw it was MapDB I stopped. You confirmed it is indeed the issue so I suggest you stick with MVStore for now. I will leave this ticket open as a bug. I will upgrade the MapDB library when I get a chance and have you test again if you do not mind.
— Reply to this email directly or view it on GitHub https://github.com/Norconex/collector-http/issues/234#issuecomment-191310043 .
I made a new snapshot release with the latest mapdb library version. I suggest you install in a new location to avoid duplicate jars with different versions.
Let me know if that version fixes the issue. Otherwise, I may make MVStore the default URL store if it works just fine for you (I also updated that library version in that snapshot release).
From @nuliknol:
Nope, same bug, even worse. I ran it 2 times and both times it freezed.
Keep using the MVStore implementation then. Since nobody ever reported issues after using MVStore, I will make it the default implementation in the next release.
Marking it as fixed since the default crawl data store in the latest snapshot release has been changed to the more stable MVStore.
Please confirm the snapshot release (with MVStore as default crawl store) fixes your issue.
@nuliknol: Have you witnessed any thread locking issues with the snapshot release (using the new MVStore default)? Can we close?
The fix was released in 2.5.0 (now using MVStore as default which does not have this issue).
I make it get into an infinite loop, with these rules:
This is the log file:
The crawler did everything correctly since it was just a 2 url simple page for testing purposes, and it should exit without any problem, but it stalled. After a cople of minutes of watching the "java" process consuming 90% of the CPU I decided to Control-C to interrupt it and this is what I have got on the console:
This is my config file. Real urls have been replaced for privacy purposes:
I can debug it if you tell me what to do.