Closed jorichie closed 1 year ago
The network is likely too large/connected to solve with the available memory. What system did you run this on?
@blandoplanet Is this a products concern? Seems like it to me.
Yes. The Products tag is appropriate.
The network is likely too large/connected to solve with the available memory. What system did you run this on? As per my conversation with Jesse, astrovm4 and astrovm2. Also noted at beginning of report.
I think it was a mistake that this post was closed via the last reply. Re-opened.
This may become more information than needed (so do not feel the need to read everything), but I would like to document for any future related efforts.
Final description of Problem:
The initial error is caused by a memory issue on the call cholmod_analyze(m_cholmodNormal, &m_cholmodCommon)
on line 1556 in BundleAdjust.cpp, where m_cholmodNormal is the reduced normal camera matrix used in the bundle solution and m_cholmodCommon carries around information repeatedly needed for cholmod functions.
cholmod_analyze uses various methods to order the passed in matrix for easy factorization and if it fails, returns null. This function can fail if the initial memory allocation fails, if all ordering methods fail, or if the supermodel analysis (if requested) fails. Based on the ‘problem too big’ error, I am guessing the memory allocation is failing (https://github.com/PetterS/SuiteSparse/blob/master/CHOLMOD/Cholesky/cholmod_analyze.c).
I checked which methods were being used for cholmod_analyze by running a successful bundle and printing out the m_cholmodCommon variable and looking at the ‘nmethods’ field. This indicated that AMD or COLAMD is used for ordering m_cholmodNormal based on if the matrix passed in is symmetric or non-symmetric. If a symmetric matrix is passed into cholmod_analyze, the upper or lower triangle is accessed and ordered by the function to save on memory. If a non-symmetric matrix (A) is passed in it brings in the whole matrix and orders AA’, effectively using twice the memory.
To evaluate the symmetry of m_cholmodNormal I went through the columns and checked that the largest associated row index was either less than or greater the current column index for all columns. In this way I evaluated whether the sparsely stored m_cholmodNormal was an upper triangle or lower triangle matrix, respectively.
In BundleAdjust::loadCholmodTriplet():
bool ut = true;
bool lt = true;
for (int columnIndex = 0; columnIndex < m_sparseNormals.size(); columnIndex++) {
int lastKey = m_sparseNormals[columnIndex] -> lastKey();
int firstKey = m_sparseNormals[columnIndex] -> firstKey();
ut = ut && (columnIndex == lastKey);
lt = lt && (columnIndex == firstKey);
}
std::cout << "Upper Triangular: " << ut << std::endl;
std::cout << "Lower Triangular: " << lt << std::endl;
Using this method I confirmed that the network passed in during failure (/scratch/jrichie/STechnique/ SouthPole_2017Merged_Lidar2Image1_cnetedit.net) is stored as a sparse upper triangular matrix, meaning it should be treated as symmetric matrix and the less memory intensive AMD ordering method would be used. However, a larger network of the same LROC data is able to run through the same process without failure, so I am genuinely at a loss as to why this particular network is failing.
My only theory is that cholmod_analyze is taking in the m_cholmodNormal as a non-symmetric matrix for this network, therefore requiring twice the memory as is typically needed. That would require my upper triangular analysis to somehow be incorrect, but it is the only thing I can think of that would cause cholmod_analyze to act differently (especially in terms of memory) for two very similarly sized networks.
What Else Was Tried: Since it was a memory issue, I first attempted to max out the memory allocation request on big mem for computation, using the same jigsaw call as in the ticket. This allowed for a maximum of 375 Gb of memory, but the problem only used 26 Gb and resulted in the same failure.
Next I tried running the jigsaw with fewer parameters by stepping down CAMSOLVE to velocities and keeping everything else consistent. This successfully ran, using 45Gb of memory. Since this bundle (with fewer parameters) used more memory than the bundle which fails because the ‘problem is too big’, it is very likely cholmod_analze is failing during the memory allocation step where there is not enough memory for every allocation and so the function fails, but the memory is never actually used.
I wanted to double check that the error had to do with the SIZE of the bundle and not the fact that acceleration was one of the solve parameters, so I reran the bundle with the same number of parameters (12 total) but no acceleration solves. This was done by switching CAMSOLVE from accelerations (9 parameters) to velocities (6 parameters) and switching SPSOLVE from positions (3 parameters) to velocities (6 parameters). This bundle failed with the same ‘problem too big error’. Therefore, there is not necessarily an error without how jigsaw handles the acceleration parameter and it is indeed a size issue.
We then thought that perhaps the connectivity of the matrix was causing the reduced normal camera matrix (m_cholmodNormal) to have enough off diagonal elements to make it significantly larger than what jigsaw could handle. I began creating a memory calculator for the various bundle matrices (located /home/ladoramkershner/projects/notebooks/JigsawSizeCheck_Prototype.ipynb; it is rough and needs to be reconfigured to account for the sparse storage of some of the matrices).
I was pointed at a network created in previous years that bundled (/work/projects/laser_a_work/lweller/SPoleNet/2018MayJune_Network/ SouthPole_2017Merged_SP_and_Lidar2Image3.net; new_final_jig_00to350.lis), I reran that bundle with the same parameters as the one in this ticket and it was successful. Then I compared the number of graph nodes and edges in each network. In a graph diagram, nodes represent images and edges represent a shared point between images in a pair-wise fashion. Therefore, edges are a good way to evaluate the connectivity and number of off-diagonal elements in the reduced normal camera matrix for a network.
Ticket Network | Archived Network |
---|---|
vertices: 18675 | vertices: 18929 |
edges: 1034327 | edges: 1165021 |
npoints: 1425791 | npoints: 1649017 |
nmeas: 9469089 | nmeas: 13752809 |
The archived network has more images, edges, points, and measures, so the connectivity could not explain the memory issue. To verify what choldmod_analyze was seeing I printed out the size and non-zero elements of the m_cholmodNormal
Ticket Network | Archived Network |
---|---|
m_cholmodNormal nrow: 224100 | m_cholmodNormal nrow: 227148 |
m_cholmodNormal ncol: 224100 | m_cholmodNormal ncol: 227148 |
m_cholmodNormal nzmax: 150399738 | m_cholmodNormal nzmax: 169239486 |
m_cholmodNormal xtype: 1 | m_cholmodNormal xtype: 1 |
Again, the archived network has slighty more elements and therefore would require more memory to solve. This leads me to believe it is not just barely exceeding the memory requirements.
Thanks Lauren! Do you think we should see if Ken Edmundson has any ideas on how to resolve this?
Working directly with Ken has some ethics issues around how people can work with Astro after they leave. I'm also not sure if he'd be able to work on this without access to the cluster and scratch.
Where should we go from here?
The message that is output by this error is not descriptive enough to be helpful and I am still not sure why jigsaw is erroring. Jigsaw operates as expected on a network of the same size using the same amount of memory. So I am not sure if this is a bug, but it does concern me that we cannot isolate the difference between the handling of two networks tested.
@blandoplanet What is the status on this issue? I believe we should close based on email conversation, but I do not want to close prematurely! Either way, I believe this is off the developers plate?
Jay,
I wish that we could investigate further into why the SP will not solve for acceleration and not close the post, but it is not up to me. Whereas NP is similar, SP has some numbers that are greater, and I suspect there is a parameter limiting the effort. The software reports that the bundle is too large of which I believe has merit. Below is a comparison of numbers in SP (red) versus NP. Of course I originally opened the post, but Brent Archinal and Mike Bland make such decisions as to what to do next.
Images: 9687 18673
Points: 405532 1425784
Total Measures: 3128764 9472039
Total Observations: 6257528 18944078
Good Observations: 6257528 18944078
Rejected Observations: 0 0
Constrained Point Parameters: 438144 1438252 Constrained Image Parameters: 116244 168057 Unknowns: 1332840 444540 Degrees of Freedom: 5479076 16104978 Convergence Criteria: 1e-05(Sigma0) 1e-05(Sigma0) Iterations: 4 5
-Janet
From: jlaura notifications@github.com Sent: Monday, August 17, 2020 11:07 AM To: USGS-Astrogeology/ISIS3 ISIS3@noreply.github.com Cc: Richie, Janet O jrichie@usgs.gov; State change state_change@noreply.github.com Subject: [EXTERNAL] Re: [USGS-Astrogeology/ISIS3] LROC SouthPole network cannot solve for acceleration in jigsaw (#3871)
This email has been received from outside of DOI - Use caution before clicking on links, opening attachments, or responding.
@blandoplanethttps://github.com/blandoplanet What is the status on this issue? I believe we should close based on email conversation, but I do not want to close prematurely! Either way, I believe this is off the developers plate?
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHubhttps://github.com/USGS-Astrogeology/ISIS3/issues/3871#issuecomment-675000330, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ALIYJUQVRHUOARZN6QYMGVTSBFPVLANCNFSM4NCXQAOA.
@jorichie Thanks for the post! I agree 100% that finding out what is going on has value. Right now though, the development team has put two weeks of debugging effort into this. Checkout the conclusion in the lengthy report post from @ladoramkershner above when she concludes with:
Again, the archived network has slighty more elements and therefore would require more memory to solve. This leads me to believe it is not just barely exceeding the memory requirements.
At this point, I do not believe that the development team has other avenues to explore on this as the problem is not constrained well enough to have us aim in any particular direction. We have an internal email chain (@blandoplanet, @ladoramkershner, and Brent) discussing some other options that do not include devs.
Some additional information from @ladoramkershner about a possible next step in the future...
"the next task would be to install a custom build of cholmod and extracting more specific information from choldmod_analyze (the function that is failing). During my troubleshooting I was checking things upstream of the command and cross-referencing the documentation to predict which methods choldmod_analyze would use. However, more information may come of actually printing variables and status from inside the actual function. "
@ladoramkershner, @jorichie, and myself came back to this and did some more work last week; here's what we did and found.
We tested running the bundle with some slightly different parameters.
First, we ran the bundle with rectangular coordinates. The hope was that this would help eliminate any errors from longitude domain or the pole. It would also slightly change the math because we would be solving for different ground point coordinates. One change that was required for this was converting the ground point sigmas from lat/lon/rad to x/y/z. In the latitudinal bundle only the radius was constrained, so we decided to constrain the Z point by that much. The data set is close to the pole so, the vast majority of radius variation would be in the Z direction. This test still failed in the same place with the same error.
Next, we ran the bundle without the overhermite setting enabled. This could help check for errors in the polynomial setup portion of the bundle. Unfortunately, this also failed in the same way.
We wanted to see a subregion of the network would successfully bundle with acceleration. This could help us narrow down if there are any specific images, points, or measures that are causing this error.
We identified regions that contained issues in the mosaic when solving for velocities and then extracted those subregions of the network. Unfortunately, extracting the subregions compromised the integrity of the network and additional work would have been required to make the subregions bundle by themselves.
We ultimately decided that attempting to make various subregions bundle by themselves would take too long and that the potential value was not worth it.
It is possible that the normal matrix is ill-conditioned and CHOLMOD could be running into issues when it tries to analyze it. Computing the normal matrix requires computing partial derivatives and these can sometimes run into discontinuities resulting in extremely large numbers. When this happens, the resulting normal matrix could have extremely large values in some places that will result in a failure to solve the iterations.
To check for this, we inserted a small bit of code to compute some statistics on the non-zero values in the normal matrix. We then ran the debug code on the network described in this issue (the Active Net), and an older network consisting of LRO NAC images of the North Pole (the Archived Net). Here are the results for the active network in question and the archived network that is able to bundle with accelerations:
Stat | Archived Net | Active Net | Difference |
---|---|---|---|
Minimum | -15495184677439.5 | -4855168835499.75 | 10640015841939.75 |
Maximum | 37370058443140.8 | 16392435737669.1 | 20977622705471.695 |
Average | 445321714.549715 | 307498754.542037 | 137822960.00767797 |
Standard Deviation | 56499889942.841 | 38040249205.63005 | 18459640737.210503 |
Non-zero Elements | 169239486 | 150399738 | 18839748 |
None of the values stand out as too large to work with.
After some discussion it was found that a previous version of the network could be bundle adjusted solving for acceleration on old hardware at the ASC, but when we moved to new hardware, it could only solve for velocities. This could help narrow down any changes in the network or code that caused this.
We looked for old processing and log files to determine exactly which version of ISIS and network were used on the old hardware and which were used on the new hardware.
We found a log that successfully solved for acceleration. Here is the version info:
IsisVersion = "3.5.00.7260 beta | 2016-01-25"
ProgramVersion = 2014-02-13
Here is the network
CNET = SouthPole_2017Merged_SP_and_Lidar2Ima-
ge2_cnetedit.net
The print file can be found at /scratch/jrichie/SOUTHPOLE.old/NEW/print.prt
We could not find a log file from just after the transition that attempted to solve for acceleration on new hardware. The closest we found was a successful solve for velocities on the new hardware. Here is the version info:
IsisVersion = "3.5.2.8306 beta | 2017-11-04"
ProgramVersion = 2017-08-09
Here is the network
CNET = SouthPole_5test_velocity_not_updated_-
cnetedit.net
The print file can be found at /usgs/shareall/FOR_SP_ISIStest/onelasttest.prt
This gives us a rough bound between February 2014 and August 2017, ISIS 3.5.0 to ISIS 3.5.2.
We still have access to ISIS 3.5.0 and later on hardware at the ASC, so we decided to test and see if we could narrow this down further. We attempted to run the bundle with solving for acceleration under version 3.5.0, 3.5.1, 3.5.2, and 3.6.0. Unfortunately, for 3.5.0 and 3.5.1 we ran into an error:
**ERROR** Unable to create camera for cube file /work/users/elee/jrichie/LROC_UPDATED_LEVELs/M111241245RE.lev1.cub in ControlNet.cpp at 1617.
**ERROR** Unable to initialize camera model from group [Instrument] in CameraFactory.cpp at 97.
**I/O ERROR** Unable to open [/work/users/elee/jrichie/LROC_UPDATED_LEVELs/M111241245RE.lev1.cub] in Blob.cpp at 278.
There is some sort of issue reading the Table blobs that contain the SPICE data. We may be able to work around this by re-running spiceinit on the images using the version of ISIS we plan to bundle with. Unfortunately we ran out of time and also ran into some issues that need to be resolved with our processing cluster before this can continue.
We also looked at all of the changes to jigsaw between 3.5.0 and 3.5.2. Here are the changes to the bundle adjust during that period, the changes that we think could impact this issue are in bold:
The most promising lead is narrowing down when this worked and when this stopped working. Checking each ISIS version is a good idea but will require duplicating the data and then re-processing. Investigating suspected code changes will require careful examination of the code at the time they were made and going over the execution path.
Compiling CHOLMOD with DEBUG flags enabled is also still an option.
Thank you for your contribution!
Unfortunately, this issue hasn't received much attention lately, so it is labeled as 'stale.'
If no additional action is taken, this issue will be automatically closed in 180 days.
Still waiting for a good test case here
Thank you for your contribution!
Unfortunately, this issue hasn't received much attention lately, so it is labeled as 'stale.'
If no additional action is taken, this issue will be automatically closed in 180 days.
Unresolved, keep open.
Jigsaw not only fails to solve for acceleration but now won't solve for velocity. Solves for camera angles and position okay. Still getting the same cholmod error as when I opened the post. CHOLMOD error: problem too large. file: ../Supernodal/cholmod_super_symbolic.c line: 683 CHOLMOD error: argument missing. file: ../Cholesky/cholmod_factorize.c line: 121 CHOLMOD error: argument missing. file: ../Cholesky/cholmod_solve.c line: 1062 /var/spool/slurmd/job45368968/slurm_script: line 39: 29769 Segmentation fault (core dumped)
This network was the most recent network successfully used in jigsaw to solve for velocity before it failed. /usgs/shareall/FOR_ISIS/SouthPole_2017Merged_SP_and_Lidar2Image4_updated_image.lis /usgs/shareall/FOR_ISIS/SouthPole_2020_Merged_Lidar2Image_redo12.net
This network follows the successful run, but jigsaw failed to solve for velocity.
/usgs/shareall/FOR_ISIS//usgs/shareall/FOR_ISIS/SouthPole_2020_Merged_Lidar2Image_redo13edit10_angles_outlier.net new_SouthPole_2017Merged_SP_and_Lidar2Image4_updated_image.lis (updated) Note: Only points and measures had been added to the network. (miscommunication happened here).
@jorichie Did the network that failed us the same image list (SouthPole_2017Merged_SP_and_Lidar2Image4_updated_image.lis)?
Hi @jorichie, what was the error message for the failure when you ran jigsaw against /usgs/shareall/FOR_ISIS/SouthPole_2020_Merged_Lidar2Image_redo13edit10_angles_outlier.net?
I ran cnetstats for each of the networks mentioned in your last post using the only image list you mentioned (SouthPole_2017Merged_SP_and_Lidar2Image4_updated_image.lis) and saw that SouthPole_2020_Merged_Lidar2Image_redo13edit10_angles_outlier.net has 3 images in the list that are NOT in the network as well as the newer network having less points and measures than the redo12 (not added points and measures as you suggested in your last sentence). Maybe you used a different image list for redo13?
I'm just wondering if jigsaw was complaining about passing it a list where some of the images were not in the input network. I also noticed the redo13 network no longer has any of the lidar2image ground control points. That would not necessarily cause an error with jigsaw, but I wondered if maybe you meant to run jigsaw on a different network than the one that failed.
Both networks have many images having only 2 or 3 points which might not be helping the solution to be stable. I would consider removing those. I also recommend running cnetedit to remove ignored points and measures from the network before running jigsaw if you are not keeping those around for some reason. Jigsaw will run fine with them, but a clean network might make it easier to track problems.
I'm not suggesting any of the above will fix the problem you originally posted about, but you might be able to solve for camera velocities for redo13 with an updated list.
It should be. I loaded it in qnet and it didn't report missing any files.
From: Mike B @.> Sent: Tuesday, March 29, 2022 4:13 PM To: USGS-Astrogeology/ISIS3 @.> Cc: Richie, Janet O @.>; Mention @.> Subject: [EXTERNAL] Re: [USGS-Astrogeology/ISIS3] LROC SouthPole network cannot solve for acceleration in jigsaw (#3871)
This email has been received from outside of DOI - Use caution before clicking on links, opening attachments, or responding.
@jorichiehttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fjorichie&data=04%7C01%7Cjrichie%40usgs.gov%7Cbc9c869d8b8245b8e05908da11d9d0bd%7C0693b5ba4b184d7b9341f32f400a5494%7C0%7C0%7C637841924445465704%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=%2FFEZim%2FO4nRz8bOa%2FsuApzCzafI8omYAXMdvarVaLRg%3D&reserved=0 Did the network that failed us the same image list (SouthPole_2017Merged_SP_and_Lidar2Image4_updated_image.lis)?
— Reply to this email directly, view it on GitHubhttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FUSGS-Astrogeology%2FISIS3%2Fissues%2F3871%23issuecomment-1082457842&data=04%7C01%7Cjrichie%40usgs.gov%7Cbc9c869d8b8245b8e05908da11d9d0bd%7C0693b5ba4b184d7b9341f32f400a5494%7C0%7C0%7C637841924445465704%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=YigafD25wYKjJwhJSaEmpWEfLarbbomwkD4VqQniPvs%3D&reserved=0, or unsubscribehttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FALIYJURCWL2DOLNSP6ELRETVCOFDPANCNFSM4NCXQAOA&data=04%7C01%7Cjrichie%40usgs.gov%7Cbc9c869d8b8245b8e05908da11d9d0bd%7C0693b5ba4b184d7b9341f32f400a5494%7C0%7C0%7C637841924445465704%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=c9AnRFMCSv0ExbPddQ0QHb8vSC99gE2xApZg%2FvMqgPA%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.***>
qnet will not complain unless there are images in the network that are not in the list. If there are extra images in the list that's ok. cnetstats will show images having 0 points and cnetcheck will make a NoControl list if it encounters images that are in the list but not in the network. jigsaw will definitely complain if there is any sort of mismatch between the input list and network.
Just to be clear, is the network named above the first version that failed, or is it the most recent version of the network? I just want to make sure we all understand what we are looking at. Thanks!
This one fails to solve for velocity. /usgs/shareall/FOR_ISIS/SouthPole_2020_Merged_Lidar2Image_redo13edit10_angles_outlier.net
From: Mike B @.> Sent: Tuesday, March 29, 2022 4:24 PM To: USGS-Astrogeology/ISIS3 @.> Cc: Richie, Janet O @.>; Mention @.> Subject: [EXTERNAL] Re: [USGS-Astrogeology/ISIS3] LROC SouthPole network cannot solve for acceleration in jigsaw (#3871)
This email has been received from outside of DOI - Use caution before clicking on links, opening attachments, or responding.
Just to be clear, is the network named above the first version that failed, or is it the most recent version of the network? I just want to make sure we all understand what we are looking at. Thanks!
— Reply to this email directly, view it on GitHubhttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FUSGS-Astrogeology%2FISIS3%2Fissues%2F3871%23issuecomment-1082463010&data=04%7C01%7Cjrichie%40usgs.gov%7Cb41f51f6db0b4619e81308da11db3d0f%7C0693b5ba4b184d7b9341f32f400a5494%7C0%7C0%7C637841930558515343%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=i5hmASjSWGd40gfMGbN2cd6obp5PaXpTI8uqSNPZ%2Fxw%3D&reserved=0, or unsubscribehttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FALIYJUTVGHDXKLIPLCKD5Z3VCOGJTANCNFSM4NCXQAOA&data=04%7C01%7Cjrichie%40usgs.gov%7Cb41f51f6db0b4619e81308da11db3d0f%7C0693b5ba4b184d7b9341f32f400a5494%7C0%7C0%7C637841930558515343%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=prSchK2%2FNNx1UP3A0pnWIpwtG0gkTjQ2Dk%2BePmja9xw%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.***>
I should mention these are old files from around May/June 2021 when jigsaw first failed to solve for velocity. These files are iteration 12 or 13. Our current network is iteration 60. I thought the programmers would want a copy of a network before it failed and then a copy of one nearest to when it failed for easier tracking.
If this isn't what they want and they want a clean network, then I am still working on it.
FYI: I updated the list. Next I ran jigsaw and here is the result:
CHOLMOD error: problem too large. file: ../Supernodal/cholmod_super_symbolic.c line: 683 CHOLMOD error: argument missing. file: ../Cholesky/cholmod_factorize.c line: 121 CHOLMOD error: argument missing. file: ../Cholesky/cholmod_solve.c line: 1062 /var/spool/slurmd/job45401251/slurm_script: line 38: 29172 Segmentation fault (core dumped) jigsaw fromlist= new_SouthPole_2017Merged_SP_and_Lidar2Image4_updated_image.lis cnet= new_SouthPole _2020_Merged_Lidar2Image_redo13edit10_angles_outlier.net onet= new2_SouthPole_2020_Merged_Lidar2Image_redo13edit13_angles_outlier.net update=no sigma0=1.0e-5 maxits=6 errorpropagation=no radius=ye s camsolve=velocities twist=yes overexisting=yes outlier_rejection=no spsolve=position overhermite=yes camera_angles_sigma=1.0 camera_angular_velocity_sigma=0.5 camera_angular_acceleration_sigma=0 .25 spacecraft_position_sigma=250 point_radius_sigma=150 file_prefix=SouthPole_2020_MergedLidar2Imageredo13
Thank you for suggestions.
From: lwellerastro @.> Sent: Tuesday, March 29, 2022 4:14 PM To: USGS-Astrogeology/ISIS3 @.> Cc: Richie, Janet O @.>; Mention @.> Subject: [EXTERNAL] Re: [USGS-Astrogeology/ISIS3] LROC SouthPole network cannot solve for acceleration in jigsaw (#3871)
This email has been received from outside of DOI - Use caution before clicking on links, opening attachments, or responding.
Hi @jorichiehttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fjorichie&data=04%7C01%7Cjrichie%40usgs.gov%7C20003892cd2c486a936e08da11d9ee0d%7C0693b5ba4b184d7b9341f32f400a5494%7C0%7C0%7C637841924951025844%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=ei%2FGcg7rNEHpmOBSr%2FAlhjnbOmBbJWC26AskrPoRT%2BM%3D&reserved=0, what was the error message for the failure when you ran jigsaw against /usgs/shareall/FOR_ISIS/SouthPole_2020_Merged_Lidar2Image_redo13edit10_angles_outlier.net?
I ran cnetstats for each of the networks mentioned in your last post using the only image list you mentioned (SouthPole_2017Merged_SP_and_Lidar2Image4_updated_image.lis) and saw that SouthPole_2020_Merged_Lidar2Image_redo13edit10_angles_outlier.net has 3 images in the list that are NOT in the network as well as the newer network having less points and measures than the redo12 (not added points and measures as you suggested in your last sentence). Maybe you used a different image list for redo13?
I'm just wondering if jigsaw was complaining about passing it a list where some of the images were not in the input network. I also noticed the redo13 network no longer has any of the lidar2image ground control points. That would not necessarily cause an error with jigsaw, but I wondered if maybe you meant to run jigsaw on a different network than the one that failed.
Both networks have many images having only 2 or 3 points which might not be helping the solution to be stable. I would consider removing those. I also recommend running cnetedit to remove ignored points and measures from the network before running jigsaw if you are not keeping those around for some reason. Jigsaw will run fine with them, but a clean network might make it easier to track problems.
I'm not suggesting any of the above will fix the problem you originally posted about, but you might be able to solve for camera velocities for redo13 with an updated list.
— Reply to this email directly, view it on GitHubhttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FUSGS-Astrogeology%2FISIS3%2Fissues%2F3871%23issuecomment-1082458223&data=04%7C01%7Cjrichie%40usgs.gov%7C20003892cd2c486a936e08da11d9ee0d%7C0693b5ba4b184d7b9341f32f400a5494%7C0%7C0%7C637841924951025844%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=eQYiTWP3OpG7j7p2gz12xVkIwBZQmYvjQArGnsVU5sg%3D&reserved=0, or unsubscribehttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FALIYJURUIMARPZD4QP4O4ODVCOFGVANCNFSM4NCXQAOA&data=04%7C01%7Cjrichie%40usgs.gov%7C20003892cd2c486a936e08da11d9ee0d%7C0693b5ba4b184d7b9341f32f400a5494%7C0%7C0%7C637841924951025844%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=gg%2FMHhPN9CMH5ndhpz4d%2BlIfqDdE1FCdbdsppl8oHpw%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.***>
Thanks for clearing that up Janet. You are correct. I think the developers would like a before and after version of the network to work with as you have indicated.
Yes, we would like the last successful run and the first failed run using the same set of parameters; that gives us the smallest set of changes to the networks.
A little more clarification:
/usgs/shareall/FOR_ISIS/SouthPole_2017Merged_SP_and_Lidar2Image4_updated_image.lis /usgs/shareall/FOR_ISIS/SouthPole_2020_Merged_Lidar2Image_redo12.net
Does this network, successfully solve for velocity when it is run right now in ISIS6.0.0?
Can you also provide the parameters you used to solve for velocities so we can reproduce this. We have the attempted acceleration solve parameters, but not the velocities solve.
Yes, the network listed solves for velocity. I have not run jigsaw in the current ISIS version. Do you want me to do that first? I have consistently used ISIS3.10.2.
I use Lynn's bash script to run jigsaw. I modify the parameters. See it here: /usgs/shareall/FOR_ISIS/sbatch3_jigsaw.bsh
Note: I run jobs almost always on astrovm5.
From: Jesse Mapel @.> Sent: Wednesday, March 30, 2022 10:37 AM To: USGS-Astrogeology/ISIS3 @.> Cc: Richie, Janet O @.>; Mention @.> Subject: [EXTERNAL] Re: [USGS-Astrogeology/ISIS3] LROC SouthPole network cannot solve for acceleration in jigsaw (#3871)
This email has been received from outside of DOI - Use caution before clicking on links, opening attachments, or responding.
Yes, we would like the last successful run and the first failed run using the same set of parameters; that gives us the smallest set of changes to the networks.
A little more clarification:
/usgs/shareall/FOR_ISIS/SouthPole_2017Merged_SP_and_Lidar2Image4_updated_image.lis /usgs/shareall/FOR_ISIS/SouthPole_2020_Merged_Lidar2Image_redo12.net
Does this network, successfully solve for velocity when it is run right now in ISIS6.0.0?
Can you also provide the parameters you used to solve for velocities so we can reproduce this. We have the attempted acceleration solve parameters, but not the velocities solve.
— Reply to this email directly, view it on GitHubhttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FUSGS-Astrogeology%2FISIS3%2Fissues%2F3871%23issuecomment-1083427702&data=04%7C01%7Cjrichie%40usgs.gov%7C88a4cf6448eb49d45ff708da1273f148%7C0693b5ba4b184d7b9341f32f400a5494%7C0%7C0%7C637842586435126718%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=U%2BM3ad9L7wt%2F%2BW2cTQTm2IjMVEwLKCgh91jnrhMfcm4%3D&reserved=0, or unsubscribehttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FALIYJUTZCEMWCW72K223ZQDVCSGMZANCNFSM4NCXQAOA&data=04%7C01%7Cjrichie%40usgs.gov%7C88a4cf6448eb49d45ff708da1273f148%7C0693b5ba4b184d7b9341f32f400a5494%7C0%7C0%7C637842586435126718%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=NC%2BdWqjJQyCbRXx9TpYaCBlc2IemzKF7jQJAg243GLs%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.***>
Can you test solving for velocity with iteration 12, the last one to solve successfully with velocity, and iteration 13, the first one to fail with velocity, using ISIS6.0.0.
the jobs might be launched from astrovm5, but they are being run on the cluster.
Synopsis: Both iterations, 12 and 13, failed using ISIS6.0.0.
Note: After I ran the requested tests using ISIS6.0.0, I reset the ISIS version back to ISIS3.10.2 and reran 12: it solved for velocity as expected. All work here was launched from astrovm5.
Here is the output from the failed iterations.
Under ISIS6.0.0 iteration 13 CHOLMOD error: problem too large. file: ../Supernodal/cholmod_super_symbolic.c line: 683 CHOLMOD error: argument missing. file: ../Cholesky/cholmod_factorize.c line: 121 CHOLMOD error: argument missing. file: ../Cholesky/cholmod_solve.c line: 1062 /var/spool/slurmd/job45401521/slurm_script: line 38: 23098 Segmentation fault (core dumped) jigsaw fromlist= new_SouthPole_2017Merged_SP_and_Lidar 2Image4_updated_image.lis cnet= new_SouthPole_2020_Merged_Lidar2Image_redo13edit10_angles_outlier.net onet= new2_SouthPole_2020_Merged_Lidar2Image_redo 13edit13_angles_outlier.net update=no sigma0=1.0e-5 maxits=6 errorpropagation=no radius=yes camsolve=velocities twist=yes overexisting=yes outlier_reje ction=no spsolve=position overhermite=yes camera_angles_sigma=1.0 camera_angular_velocity_sigma=0.5 camera_angular_acceleration_sigma=0.25 spacecraft_p osition_sigma=250 point_radius_sigma=150 file_prefix=SouthPole_2020_Merged_Lidar2Imageredo13
Time to completion: Started: 2022-03-30T13:01:40 Finished: 2022-03-30T13:39:28
From: Jesse Mapel @.> Sent: Wednesday, March 30, 2022 11:35 AM To: USGS-Astrogeology/ISIS3 @.> Cc: Richie, Janet O @.>; Mention @.> Subject: [EXTERNAL] Re: [USGS-Astrogeology/ISIS3] LROC SouthPole network cannot solve for acceleration in jigsaw (#3871)
This email has been received from outside of DOI - Use caution before clicking on links, opening attachments, or responding.
Can you test solving for velocity with iteration 12, the last one to solve successfully with velocity, and iteration 13, the first one to fail with velocity, using ISIS6.0.0.
— Reply to this email directly, view it on GitHubhttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FUSGS-Astrogeology%2FISIS3%2Fissues%2F3871%23issuecomment-1083482580&data=04%7C01%7Cjrichie%40usgs.gov%7C2f224dd9a32a46bb1a2708da127c0db1%7C0693b5ba4b184d7b9341f32f400a5494%7C0%7C0%7C637842621258103173%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=YmmqulRwwhFWVQTnBdCvOy%2FNFuhy2ZtZVySsJh8U9YY%3D&reserved=0, or unsubscribehttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FALIYJUXT6FVYX2NDSDEV3ZDVCSNGTANCNFSM4NCXQAOA&data=04%7C01%7Cjrichie%40usgs.gov%7C2f224dd9a32a46bb1a2708da127c0db1%7C0693b5ba4b184d7b9341f32f400a5494%7C0%7C0%7C637842621258103173%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=gXPNaEy06GuaY1%2BLz%2BesaeT%2BcwIh579O4%2F2iZYxfdjA%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.***>
The redo12 jigsaw error via 6.0.0 seems like the problem I have recently encountered while trying to run things on the cluster and solving for spacecraft - post #4770.
you could try running directly on astrovm4 which should have enough memory. astrovm5 never seems to have memory available. Running "free -h" on astrovm5 shows it only has 12G available which is not enough for this network. You need about 35 G or so of memory to run without swapping.
A couple of notes that may or may not be helpful for additional work on this problem. I spent some time trying to understand what works and what doesn't when running jigsaw. This is a naive, empirical approach and I'm not sure it lead to any new insight, but here you go.
For each run of jigsaw I used the network and list that was provided above: Network: new_SouthPole_2020_Merged_Lidar2Image_redo13edit10_angles_outlier.net List: new_SouthPole_2017Merged_SP_and_Lidar2Image4_updated_image.lis
I ran under ISIS6.0.0 and I ran everything on Astrovm5 so that I could watch it with Top. I only ran a single iteration. All of the runs can be found in /scratch/mbland/LROC_SP/jigsaw-tests/new_version_Mar30_2022/
Summary:
The failures are consistent with previous results - it seems to be the number of parameters solved for, rather than solving for velocity itself that causes the solution to fail (i.e., you CAN solve for velocity but not velocity + spacecraft).
One last note - I didn't dig into the network too far, but there are a lot of images with VERY small convex hull (from cnetcheck). This suggests there is only one or two points on those images. Some of those images were very noisy. We made need to be more selective about what images are included. I don't see why that would cause jigsaw to fail though. There are always "bad" points in a network this size.
there are a lot of images with VERY small convex hull (from cnetcheck). This suggests there is only one or two points on those images.
cnetcheck will list images with convex<1.0 by default (lowcoverage=true tolerance=1.0), which means it will list all images. cnetstats on the other hand will list the convex hull for every image, but that information is only available when create_image_stats=true create_stats_file=output.csv
. The convex hull information is in column 9.
convex hull is a measure of how spread apart the points are on an image and not the number of points on an image. You could actually have a high convex hull with only 4 points, one at each corner of the image. I like to look at the combination of total number of points and convex hull. Any image with low number of points and/or low convex hull gets my attention.
I noticed the very small convex hull values and the high number of images with only a few points (2-3) when I ran cnetstats on both redo12 and redo13.
Edit: never bothered to look before... but cnetcheck does provide the actual convex hull info per image in the LowCoverage.txt output file, but it does not include the total number of points per image, which is also useful to know.
Redo 12 and 13 is not a good use case for evaluating the current network because those are interim networks from May/June best used for the developers to assess the velocity issue. Much progress has been made on the networks, for example, we have had to leave in images that have only a few points because they are the only images we have to link to other images in the network, and many others have been removed. The current network is 60 and can be found on /scratch/jrichie/S-Technique.
From: lwellerastro @.> Sent: Monday, April 4, 2022 12:10 PM To: USGS-Astrogeology/ISIS3 @.> Cc: Richie, Janet O @.>; Mention @.> Subject: [EXTERNAL] Re: [USGS-Astrogeology/ISIS3] LROC SouthPole network cannot solve for acceleration in jigsaw (#3871)
This email has been received from outside of DOI - Use caution before clicking on links, opening attachments, or responding.
there are a lot of images with VERY small convex hull (from cnetcheck). This suggests there is only one or two points on those images.
cnetcheck will list images with convex<1.0 by default (lowcoverage=true tolerance=1.0), which means it will list all images. cnetstats on the other hand will list the convex hull for every image, but that information is only available when create_image_stats=true create_stats_file=output.csv. The convex hull information is in column 9.
convex hull is a measure of how spread apart the points are on an image and not the number of points on an image. You could actually have a high convex hull with only 4 points, one at each corner of the image. I like to look at the combination of total number of points and convex hull. Any image with low number of points and/or low convex hull gets my attention.
I noticed the very small convex hull values and the high number of images with only a few points (2-3) when I ran cnetstats on both redo12 and redo13.
— Reply to this email directly, view it on GitHubhttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FUSGS-Astrogeology%2FISIS3%2Fissues%2F3871%23issuecomment-1087914596&data=04%7C01%7Cjrichie%40usgs.gov%7Cfc1937bdefba4ec74ba908da166eccb8%7C0693b5ba4b184d7b9341f32f400a5494%7C0%7C0%7C637846962382190928%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=Rk5EeSNqbR2EjnpJXh8ui%2FXrdlKY9q5SfzigNYO0sM4%3D&reserved=0, or unsubscribehttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FALIYJUVLYWCSUEI3YZTI7FLVDM5CPANCNFSM4NCXQAOA&data=04%7C01%7Cjrichie%40usgs.gov%7Cfc1937bdefba4ec74ba908da166eccb8%7C0693b5ba4b184d7b9341f32f400a5494%7C0%7C0%7C637846962382190928%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=uzTr6IyX1OTp9XmCMOBUlR9rF4GMy9hGBPjfffGlayg%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.***>
We can no longer load the SouthPole network on /scratch. It starts to load, but then it throws an error: killedprocess, and then it dies. We get the same error whether we are working from home or the office. We need to figure out whether this is a IT problem or an ISIS problem and would appreciate your help. We access astrovm5 to load the network in qnet.
To recreate the error, please find the list and control network at /usgs/shareall/FOR_ISIS.
List: SouthPole_2020Merged_SP_and_Lidar2Image4_updated6_image.lis Control network: SouthPole_2020_Merged_Lidar2Image_redo68.net
re:#3871
I can confirm problems with qnet while it opens the list of images Janet indicated in her post. The program doesn't even get to asking for a network to load and it dies while loading the images.
I am running isis7.0.0 and launched qnet from astrovm-lynn which is a special IT build I work on that was built to match astrovm5. I was running from my work users area and the images in the list are also on /work. I'm not sure /scratch is involved at all unless Janet is running things from a /scratch directory.
However, I can load the image and the network in qnet running from astrovm4! So oddly, there is an issue with 5 (and my system) that 4 doesn't have. I don't know why this would be the case and don't have a solution, but I hope this test helps to better inform others who might look into the problem. @jorichie, in the meantime you might try to use astrovm4 and see if that works for you.
I can't activate conda on astrovm4. I get command not found. Any clue why that would be? Ella also gets command not found.
It sounds like it is running out of memory while loading the network. I wonder if there's something wrong with the VM or scratch itself.
I can't activate conda on astrovm4. I get command not found. Any clue why that would be? Ella also gets command not found.
I don't know why that would be. Maybe something about the version your were trying to use? I copied your image list and network to my work users area and set conda to a recent version of isis and launched qnet:
conda activate isis7.0.0
qnet
I left before the network was fully loaded, but it did load and I can load points, etc. Maybe try that version of isis.
@jessemapel, qnet is up and running on astrovm4 for me and is currently using about 34G of memory. I'm not sure it requires more than that when it is loading the images, but I suppose lack of memory could be a problem for astromv5. vm5 has maybe 65G of memory, but vm4 has a bit over 100G. Maybe there were too many other things running on vm5 when Janet had her problem, but wouldn't it just run slower and maybe use swap?
Update: No memory available on astrovm5 - IT needs to know about this because it doesn't look like much is actively running over there, so maybe zombie processes or something in the background is using it up.
ast{104}> free -h total used free shared buff/cache available Mem: 62G 58G 613M 328K 3.7G 3.7G Swap: 12G 5.4G 7.4G
Yeah the
error: killedprocess
failure indicates that the OS abruptly stopped the application for some reason. The most common cause for this is memory issues. Hopefully IT can get the VMs reset and good to use again.
I put a ticket in for IT to have a look and see if they see unusual. sinteractive on the cluster is a good alternative to use if astrovm4 has problems too.
Lynn's sinteractive suggestion allows the proper loading of the network. IT plans to resolve "some stuck processes and excess memory consumption," on Aug. 13 at 5:00 PM. We are now able to run jobs and load qnet after the IT changes without having to use sinteractive.
@jorichie, if jigsaw still can't solve for accelerations for you network, you probably wan't to keep this post open.
I have been able to reproduce this with a Mars network when adding a single ground control point with a computed covariance matrix.
I have networks available with and without ground points that illustrate the problem. I have linked those internally only at a code.chs.usgs.gov repository and provided access to the folks that are going to be working the problem.
@jorichie Can you post your most recent jigsaw command lines that you ran. We're investigating some issues potentially related to not setting sigmas for every parameter. For example, your original post doesn't set point latitude/longitude sigmas.
Just curious @jessemapel - do you think something might have changed how point latitude/longitude sigmas are being used over the past several years? Those sigmas were not used for the LROC NAC north pole network or any of the Themis IR bundles (including the global), both of which had ground points and solved for radius, camera accelerations and spacecraft position without problems. All of that work ended around 2017/2018.
Since that time, I have found I need to add point lat/lon sigmas for Europa, Titan and now Phobos (but not Kaguya, but maybe it would help with difficult quads). I figured for the global data sets with limb and disk images having no or few ground points, it was necessary to help keep things from leaping all over the place and the extra constraints have generally helped. But those are not settings we've been encouraged to use in the past or had a need. I've tried to pick sigmas that make some sense for the data (considering resolution mostly and how good/bad the spice is) so they have ranged from 1500-5000 meters or more have helped for my problem projects.
We've made several changes to things in the bundle over the last several years. The big ones that could impact point sigmas are the rectangular, XYZ, bundle and the lidar support. They were supposed to preserve the existing functionality, but could have introduced a bug. We also haven't 100% confirmed that not setting sigmas is the problem, it's just our best lead right now. Jay is doing a bunch more testing today. @lwellerastro your comments also help back up this being the potential issue.
Also, I agree we should not be setting point lat/lon sigmas in most cases. It seems the logic that is supposed to leave them free has a bug.
Setting them for right now, is a work around.
ISIS version(s) affected: isis3.10.2 on astrovm4, previously astrovm2
Description
Jigsaw will not solve for acceleration. Solves for camera angles and velocity okay. Error: Validation complete!... starting iteration 1 CHOLMOD error: problem too large. file: ../Supernodal/cholmod_super_symbolic.c line: 683 CHOLMOD error: argument missing. file: ../Cholesky/cholmod_factorize.c line: 121 CHOLMOD error: argument missing. file: ../Cholesky/cholmod_solve.c line: 1062 /var/spool/slurmd/job6627021/slurm_script: line 39: 16235 Segmentation fault (core dumped)
How to reproduce
Here are the parameters used for jigsaw: *also /scratch/jrichie/S-Technique/sbatch4_jigsaw.bsh; all pertinent files can be found at /scratch/jrichie/STechnique jigsaw fromlist= new11_updated_jig_00to350.lis cnet= SouthPole_2017Merged_Lidar2Image1_cnetedit.net \ onet= SouthPole_2017Merged_Lidar2Image2.net \ update=no sigma0=1.0e-5 maxits=3 errorpropagation=no \ radius=yes camsolve=accelerations twist=yes overexisting=yes \ outlier_rejection=no \ spsolve=position overhermite=yes \ camera_angles_sigma=1.0 \ camera_angular_velocity_sigma= 0.5 \ camera_angular_acceleration_sigma=0.25 \ spacecraft_position_sigma=250 \ point_radius_sigma=150 \
Possible Solution
Check that memory requirement is sufficient or "number of" limitations, (if any), in the code. Weller was able to solve for acceleration for north pole. Differences are South Pole is twice as large and has a significant amount of shadows. Additional context