Omertron commented 9 years ago

Original issue 2546 created by Omertron on 2012-12-16T13:47:50.000Z:

What steps will reproduce the problem?

Start fresh scan with very large data set (2000+ dirs)
Let it progess...

What is the expected output? What do you see instead? Expect: Scan to complete and files generated Actual: The scan starts ok, but seems to slow down. Memory usage seen as increasing in windows task manager. At some point, for me after about 550 dirs scanned, the 'Finished: <film title>' will stop appearing, while the 'Processing new video: <film title>' will continue. Eventually, for me around 700 dirs, the processing will slow to a complete crawl (or might actually be stopping completely). At this point task manager reports memory usage at around 850MB (I have 8GB total so is not running out of memory. Plenty is free). Ctrl-C'ing at this point takes 10-30 seconds to be registered.

What version/revision of YAMJ are you using? On what operating system? Tried using these snapshots from Omertrons site (both fail): yamj-2.8-SNAPSHOT-r3342 yamj-2.8-SNAPSHOT-r3330 Last version used successfully, also from Omertrons (it's been a while): yamj-2.6-SNAPSHOT-r2924

What skin are you using? Default

What device are you using YAMJ on? Networked Media Tank, PlayonHD, Other? Popcorn A-210

Please provide any additional information below. My data set is very large, around 2000+ dirs and 11000+ files. It includes both individual films plus a significant amount of tv series. With the 2.6 version, it would take around 4-5 hours to complete, but with the newest version it simply halts around 'G' for me even on an overnight run.

moviejukebox.properties:

This is the skin property, you will need to change this if you want to change the look of YAMJ

mjb.skin.dir=./skins/default
# Add other properties here
mjb.libraryRoot=library.xml
mjb.playerRootPath=file:///opt/sybhttpd/localhost.drives/SATA_DISC_A1/
mjb.jukeboxRoot=P:/
mjb.forceNFOEncoding=AUTO
poster.scanner.coverArtDirectory=JPG
poster.scanner.SearchPriority.tv=thetvdb,cdon
mjb.filter.genres=true
mjb.newdays.movie=0
mjb.newcount.movie=0
mjb.newdays.tv=0
mjb.newcount.tv=0
mjb.forceIndexOverwrite=true
mjb.forceHTMLOverwrite=true
mjb.recheck.Days=0

Omertron commented 9 years ago

Comment #1 originally posted by Omertron on 2012-12-16T15:50:41.000Z:

Additional information: I've just tried JRE 7 too with the same result (had an old JRE 6 installed before). I'm running Windows 7 - 64 bit. When the problem manifests itself CPU usage starts going up. When the scan starts out, I see CPU usage at around 1-2%, but this goes up, and when it finally stalls it is using a full CPU core 100%.

Omertron commented 9 years ago

Comment #2 originally posted by Omertron on 2012-12-16T21:11:23.000Z:

Do you have any multi-threading properties set? Have you changed the moviejukebox.bat file to increase memory usage? You can try the following command in the properties file to force a better clearing of the memory used by YAMJ: mjb.showMemory=true

Omertron commented 9 years ago

Comment #3 originally posted by Omertron on 2012-12-16T23:02:44.000Z:

Hi Stuart. Thanks for replying :-) I haven't got multi-threading set, and will try a run tonight with more memory params in the .bat file (2gb for both). I've also added the showMemory parameter to the properties file. Will report back tomorrow.

Omertron commented 9 years ago

Comment #4 originally posted by Omertron on 2012-12-17T06:43:37.000Z:

I changed the .bat command to: java -Xms2048m -Xmx2048m -classpath .;resources;lib/* com.moviejukebox.MovieJukebox %* and added the showMemory to the properties file.

The problem is still there, with memory usage topping out at 2GB, probably around 'P'. This is where the 'Finished: ' stops appearing. From that point on, the only output for films are 'Processing new video: '. However, for TV series, the 'Finished' line still occurs.

This goes on up until 'W', where the following java exception is raised 4 times: java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Unknown Source) at java.lang.AbstractStringBuilder.expandCapacity(Unknown Source) at java.lang.AbstractStringBuilder.ensureCapacityInternal(Unknown Source) at java.lang.AbstractStringBuilder.append(Unknown Source) at java.lang.StringBuffer.append(Unknown Source) at java.io.StringWriter.write(Unknown Source) at com.moviejukebox.tools.WebBrowser.request(WebBrowser.java:139) at com.moviejukebox.tools.WebBrowser.request(WebBrowser.java:85) at com.moviejukebox.plugin.ImdbPlugin.updateImdbMediaInfo(ImdbPlugin.java:194) at com.moviejukebox.plugin.ImdbPlugin.scan(ImdbPlugin.java:137) at com.moviejukebox.plugin.DatabasePluginController.scan(DatabasePluginController.java:100) at com.moviejukebox.MovieJukebox.updateMovieData(MovieJukebox.java:1898) at com.moviejukebox.MovieJukebox$3.call(MovieJukebox.java:917) at com.moviejukebox.MovieJukebox$3.call(MovieJukebox.java:900) at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) at com.moviejukebox.tools.ThreadExecutor$ScheduledThread.run(ThreadExecutor.java:95)

After that the output shows a lot of these (around 600+): java.lang.OutOfMemoryError: Java heap space and a few of these sprinkled between them: java.lang.OutOfMemoryError: GC overhead limit exceeded

Then scanning of the files resume, with one of the 'Java heap space' exceptions per movie. When it gets to the indexing phase, the process just hangs there.... Note that with this setup the process doesn't slow to a crawl, and it did complete in around the time I would normally see with v2.6...

Hope this helps you track things down :-)

Omertron commented 9 years ago

Comment #5 originally posted by Omertron on 2012-12-17T08:25:57.000Z:

DO NOT change the first value, you WILL get memory issues that way. Java works with both those values, the Xms value determines how much to "chunk" up the memory allocated with Xmx. A rule of thumb should be to have Xms 1/8th of Xmx. so 256 in your case.

Once java reaches the allocation of 256Mb, it will do some clean-up (which has a processing and performance overhead) and then will allocate the next chunk of memory.

Therefore if you have 2048/2048 it will never clean up the memory before it runs out and hence the "GC overhead limit exceeded".

Omertron commented 9 years ago

Comment #6 originally posted by Omertron on 2012-12-17T11:33:23.000Z:

Okie... I usually work in C/C++ and have limited java experience :-) Since 'java -X' lists Xms as being the starting heap size, I thought it would be safe to up it...

I started another run after the last one, this one with Xmx set to 6144m. This run finished successfully in about the same time as the v2.6 runs (around 4.5 hours). Memory usage started out at around 100mb at start, around 500mb after directory scanning finishes, and ends up at about 4.4gb at the end.

Some of the (maybe) relevant output from the run:

Jukebox output goes to

Scanning directory # 1584, 5440 files scanned Scanning directory # 2438, 12012 files scanned Found 2336 videos in your media library Stored 12015 files in the info cache Searching for information on the video files... ... Writing Indexes XML... Indexing Other (1/7) contains 9 indexes Indexing Genres (2/7) contains 12 indexes Indexing Title (3/7) contains 27 indexes Indexing Certification (4/7) contains 19 indexes Indexing Year (5/7) contains 14 indexes Indexing Library (6/7) contains 2 indexes Indexing Set (7/7) contains 113 indexes Writing Category XML... Indexing Categories... Writing Library data... Writing Indexes HTML... Categories... Copying new files to Jukebox directory... Copying directory Jukebox (22123/22123) Copying skin files to Jukebox directory (Skin is newer)... Copying the html directory... Copying directory pictures (23/23) Copying directory Watched (3/3)

Omertron commented 9 years ago

Comment #7 originally posted by Omertron on 2012-12-17T12:05:48.000Z:

Actually, I'd really like the "jukebox_statistics.xml" file from the jukebox folder.

Also, increase your threading and you will see that processing time drop considerably.

Omertron commented 9 years ago

Comment #8 originally posted by Omertron on 2012-12-17T12:11:46.000Z:

<?xml version="1.0" encoding="UTF-8"?>

2336 2130 206 113 2336 0 64 122 10 1 82 597 2 0 0 0 0 0 0 0 2 2012-12-17 06:49:08 2012-12-17 07:02:11 2012-12-17 10:54:19 2012-12-17 10:54:25 2012-12-17 10:54:38 2012-12-17 11:17:41 2012-12-17 11:19:09 2012-12-17 11:24:01 2012-12-17 11:24:21 04:35:12.0 I know threading would speed things up, but I didn't want to hit IMDB too hard :-p The directory scanning takes a while for me too since I have quite a large chunk of the files on a NAS, and the network connection is a powerplug one, meaning about 190mbps transfer speed. I rarely do full scans, so just leave those running overnight. Update scans doesn't take long, with the majority of the time used by the directory scan. Thanks for the tip though :-)

Omertron commented 9 years ago

Comment #9 originally posted by Omertron on 2012-12-17T13:04:33.000Z:

There's an inbuilt limiter to stop you doing more than (default) 2 requests to IMDb at a time, so I wouldn't worry about that too much. The threading will help with the downloading of images as well as the scanning of directories. Even bumping it to 2 threads will help.

I have mine set to 8 threads with 8 downloaders and no issues with imdb

Omertron commented 9 years ago

Comment #10 originally posted by Omertron on 2012-12-17T14:09:18.000Z:

The downloading of images is not a problem as I by default choose one and store it locally :-p While by and large the automatic selection is good, it just irritated me to see the odd 'wrong' one (for example with Russian text), so I spent a weekend selecting and downloading pictures for everything :-) I'll up it now since you say IMDB doesn't care.

Anyway, back to the problem. I still think you have a resource leak somewhere as even for a large data set like mine 4.5GB seems a bit excessive (especially compared with v2.6)... Without any knowledge of the code, I would also hazard the wild-ass-guess that it is somewhere in the IMDB scraping. I base that on the fact that there doesn't appear to be any problems with TV series, and that information is scraped from TVDB. I may of course be completely wrong :-p

Thanks for your help and thanks for the good work you guys do on YAMJ. It is very much appreciated :-)

Omertron commented 9 years ago

Comment #11 originally posted by Omertron on 2012-12-17T14:50:46.000Z:

Indexing is the memory hog really. The indexes get exponentially larger the more movies you have.

Yes, IMDb scraping is not the most efficient method, it downloads the webpages one at a time and processes them.

If you don't need the extra data that imdb provides, and you have time to test, I'd be interested to see what effect themoviedb plugin has on your memory usage/scan times

Omertron commented 9 years ago

Comment #12 originally posted by Omertron on 2012-12-17T15:51:52.000Z:

What puzzles me is the HUGE difference between 2.6 and 2.8...

Now, I don't know in detail what you've put in there in the two intervening versions, but given that the memory usage seems to be about 2MB per movie on 2.8, where it can't have been more than 1/8 of that on 2.6 given the java executable Xmx settings (if I assume the memory used for dir scan is the same).

2MB per movie just sounds... wrong... If it can be explained with new features, all well and good, but if it can't I think it warrants a bit of looking into.

I'll see if I can get a run in with themoviedb before I head off for x-mas. I'd not use it myself since it has too little information.

YAMJ / yamj-v2

Probable resource leakage leading to slowing, then stopping scan (large data set) #2545

This is the skin property, you will need to change this if you want to change the look of YAMJ