DOI-USGS / ISIS3

Integrated Software for Imagers and Spectrometers v3. ISIS3 is a digital image processing software package to manipulate imagery collected by current and past NASA and International planetary missions.
https://isis.astrogeology.usgs.gov
Other
197 stars 167 forks source link

qmos - can't handle lots of images #1103

Closed ascbot closed 5 years ago

ascbot commented 5 years ago

Author Name: Lynn Weller (Lynn Weller)

Original Assignee: Steven Lambright


I've tried several times to import an image list into the isis3.3.1 version of qmos (as well as the isis3.3.0 version), but for some reason, after about 50% progress or so I get I/O errors in my shell saying that qmos is unable to open lots of images. Yet, once qmos is done loading what it can in the list, I can add any of the I/O Error images individually with no problem. Is there some limit on the number of images it can load?? There are 23,578 images in the list and there will be more images in the future to load (Messenger is an ongoing mission acquiring tens of thousands of images). How do I work around this in the meantime??

Steps to reproduce:

You will have to be a part of the messenger-df group to be able to read this image list. See Kris Becker if you need to be added.

I typically work via astrovm1, but I also tried this via deet to see if the system might have something to do with it. I got similar results.

I also reduced the number of threads to 4 to see if that would help, but it didn't. Anything else I can try?

ascbot commented 5 years ago

Original Redmine Comment Author Name: Debbie Cook (Debbie Cook) Original Date: 2011-11-16T01:22:39Z


Lynn, I tried to follow the steps you listed, but I was not able to reproduce the problem. I tried running qmos on my own system and on deet. I was running isis3.3.1 and in both systems I tried, the cubes all loaded successfully. I tried zooming in and that was slow, but it worked. I left the threads setting to the default value to use all available. I will come by tomorrow afternoon or Thursday to see if we can track down the problem.

Debbie

ascbot commented 5 years ago

Original Redmine Comment Author Name: Lynn Weller (Lynn Weller) Original Date: 2011-11-16T16:23:07Z


I work via astrovm1 and was sitting on /work/projects/messenger/. Could that possibly have something to do with it? Even when I was on deet, I was still sitting on /work/projects/messenger/ since that is where my file is. I wonder if the tier 2 structure has anything to do with it.

Update: I also tried this from astrovm3 while sitting on /work/users/lweller/ and using the file on shareall and got similar results.

Another note: Kris Becker just tried this on deet and had no problems except for an extremely unresponsive program once it loaded all of the images. Extremely unresponsive. After I tried deet again (from /home/lweller/) and got I/O errors. Could the programmers have settings that others don't that can be causing this?

ascbot commented 5 years ago

Original Redmine Comment Author Name: Lynn Weller (Lynn Weller) Original Date: 2011-11-16T22:31:18Z


Still trying to figure out what the problem is. There is a qmos/ directory in my home area that I moved to qmos_sav/ so that qmos could recreate config files that were there. This also resulted in I/O errors, and in some cases qmos complained about footprints not being on images (though a quick view of the label showed there was a polygon group) and also complaints about Mapping Group not being available which didn't make sense (especially when images could be loaded in qview and lat/lon info read off of them).

I also move my IsisPreference file so the system default would be used in case there was something going on with that - same errors. At a loss.

My latest test involved opening a list with only 9000 images (subset of the one on shareall), yet still getting errors. I also tried less than 9000 (6000 and less) and there were no complaints. Could it be a communication problem between the program and the disk?

ascbot commented 5 years ago

Original Redmine Comment Author Name: Lynn Weller (Lynn Weller) Original Date: 2011-11-16T22:40:06Z


FYI, I have a saved qmos project on /usgs/shareall/lweller/Q_095to228_ControlVsUncontrol.mos that has over 10,000 images in it and it opens with no errors. That was created a month or so ago.

ascbot commented 5 years ago

Original Redmine Comment Author Name: Kris Becker (Kris Becker) Original Date: 2011-11-17T08:06:31Z


Lynn, Jeff and I found what appears to be the problem. The maximum number of open files allowed by the system is being exceeded. This limit appears to be set to 1024.

I think if you run the "limit" command, the "descriptors 1024" indicates this limit. An alternative and more definitive limit resource is found in the /proc (Linux) filesystem. The pid of your process - the $$ environment varible contains your process id - /proc directory contains a file called "limits" . To inspect this file (and other interesting information) type "more /proc/$$/limits". This will show, for example:

deet[27]: more /proc/$$/limits Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 10485760 unlimited bytes
Max core file size 0 unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 128000 128000 processes Max open files 1024 1024 files
Max locked memory 32768 32768 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 128000 128000 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0

We then run qmos and monitor the number of open files as the cube list is loaded. This is done using the command "lsof -u kbecker | grep qmos | wc -l" executing it repeatedly and rapidly. This count fluctuates wildly as qmos progresses showing 100 to 750 files while processing. Towards the end of the cube load it rather suddenly jumps past 1024 and I/O errors start appearing in the terminal window.

See http://stackoverflow.com/questions/34588/how-do-i-change-the-number-of-open-files-limit-in-linux for a discussion on how to raise this limit. NOTE I am not advocating this as a solution - only for informational purposes! However, this could be considered as a temporary solution until it can be fixed (Steven is not back till next week).

The other thing we are seeing when this occurs appears to be a race condition. The GUI remains mostly unresponsive when this happens. Looking at the CPU usage, it is really high - greater than 200% or so at times - when all the cubes are loaded and the errors stop. Perhaps this is related to mishandling of exceptions thrown in the threads but that is just a guess. This should be investigated as well.

ascbot commented 5 years ago

Original Redmine Comment Author Name: Lynn Weller (Lynn Weller) Original Date: 2011-11-17T18:11:50Z


IT increased the number of max open files for astrovm1 to 4096 making it possible for qmos to load all of my 20,000+ images without error. While qmos was loading images I repeatedly ran the lsof command to count the number of qmos files open and saw it max at 2201 (though it's possible it could have been higher), and it was not uncommon to see 1224 files open at any point. Also, there are a number of files that seem to remain open even after qmos is done loading the list. Even 5-10 minutes later there are still 98 open images for some reason. As for CPU usage, it seems that value has come down to practically 0% once qmos is done loading images, unlike the 200+% seen when only 1024 files at a time were allowed. I'm not sure I understand why that is.

One more thing, I just checked the number of open files again for qmos after it finished loading (it's been 10-15 minutes now) and the count is still at 98, but when I drop the wc -l command to see the files, they are not cubes. I see mostly "mem" and library descriptions. Not really sure if this is expected or not, but thought I would point it out.

I reran everything and counted *cub files only while running lsof and saw a max of 2114 - still a bunch.

ascbot commented 5 years ago

Original Redmine Comment Author Name: Steven Lambright (Steven Lambright) Original Date: 2011-11-21T18:07:49Z


qmos opens files in a very multi-threaded way. It accumulates a list of filenames for the other threads to open. When it reports about 500 are done opening, it asks the other threads nicely to not do any new files. This resulted in roughly 600 max open cubes that I ran into, tested, and expected it to obey. Apparently I was being too nice if this many files are opening - so I'm going to put a hard limit in and see if that fixes the problem.

ascbot commented 5 years ago

Original Redmine Comment Author Name: Steven Lambright (Steven Lambright) Original Date: 2011-11-22T22:45:27Z


A fix for this should be available at: /work/users/slambright/prog/isis3/issue577_trunk/isis

If this works, we should talk about the solution some more to get some details nailed down.

ascbot commented 5 years ago

Original Redmine Comment Author Name: Lynn Weller (Lynn Weller) Original Date: 2011-11-23T00:09:28Z


This worked. I ran it on astrovm3 where the "Max open files" is 1024 (they quadrupled that for me on astrovm1 as a work around). Stop by on Wed. for further discussion. Thanks!

ascbot commented 5 years ago

Original Redmine Comment Author Name: Steven Lambright (Steven Lambright) Original Date: 2011-12-05T17:35:29Z


So here's what I wanted to talk about: I fixed it for your case, but I didn't fix it for every case. If you have detached labels, have not run footprintinit, or have old spice data (there are probably other reasons I'm not thinking of...) then opening a cube may entail opening many files instead of just one. I've added a setting that will decrease the number of open cubes to a far lower number which should work for every case.

The updated version with this setting is now available (in the same place).

ascbot commented 5 years ago

Original Redmine Comment Author Name: Lynn Weller (Lynn Weller) Original Date: 2011-12-05T20:59:56Z


Well I tested it again, but the same way as before - on my big set that has footprints and attached labels. I don't have these other situations you mentioned readily available to test. Everything works the same as before - better.

ascbot commented 5 years ago

Original Redmine Comment Author Name: Steven Lambright (Steven Lambright) Original Date: 2011-12-05T21:08:31Z


Please close the ticket so that I can put this into the beta build.

Thank you.

ascbot commented 5 years ago

Original Redmine Comment Author Name: Lynn Weller (Lynn Weller) Original Date: 2011-12-05T21:16:12Z


ok. But will I have to test it under Beta again? Then what about production when it ultimately goes there. I think this is the only reservation I have about this new process - the user is burdened with testing multiple times.

ascbot commented 5 years ago

Original Redmine Comment Author Name: Steven Lambright (Steven Lambright) Original Date: 2011-12-05T21:25:20Z


No you won't. Stuart is still going through the process of explaining everything, and things completing today is not in his best interest.

Basically you ONLY test before we put it in the system. We test it from there on.

ascbot commented 5 years ago

Original Redmine Comment Author Name: Lynn Weller (Lynn Weller) Original Date: 2011-12-12T17:03:36Z


So I will keep my fingers cross and look for these changes under production today?

ascbot commented 5 years ago

Original Redmine Comment Author Name: Steven Lambright (Steven Lambright) Original Date: 2011-12-12T18:17:37Z


It will make it into beta tomorrow (Tuesday) and production the day after (Wednesday). If it's an emergency, I can put the fix into production sooner (as early as today), but this erodes our ability to provide a stable Isis version to you (I would be much more likely to break production than if I followed the process). If you need this immediately, I believe the person to contact is Debbie.