OpenDroneMap / odm-benchmarks

Benchmark data index for OpenDroneMap and WebODM
MIT License
8 stars 10 forks source link

Processing big datasets #10

Open smathermather opened 4 years ago

smathermather commented 4 years ago

I don't know if this needs to be part of the odm-benchmarks project, but I have a particularly large dataset to process, so I am doing some monitoring of individual stages so that I can do a better job predicting processing time over the life of the project. I thought I would document that here in case it is useful to see.

smathermather commented 4 years ago

Each stage has a last written file. We can use that last written file to understand when the stage completed. It's possible to do something similar for each sub-stage of OpenSfM as well, but for now, we will restrict ourselves to the overall stage times. The following is with a ~7400 image dataset on a 20 core machine with 768 RAM:

ls -lhtr odm_dem/dtm.tif mve/mve_dense_point_cloud.ply odm_filterpoints/point_cloud.ply
odm_georeferencing/odm_georeferenced_model.laz odm_georeferencing_25d/odm_georeferencing_log.txt odm_meshing/odm_25dmesh
.ply odm_meshing/odm_mesh.ply odm_orthophoto/odm_orthophoto.tif odm_texturing/odm_textured_model_geo.mtl odm_texturing_2
5d/odm_textured_model_geo.mtl opensfm/geocoords_transformation.txt odm_report/shots.geojson img_list.txt
-rw-r--r-- 1 useruser useruser 220K Jun 25 16:20 img_list.txt
-rw-r--r-- 1 useruser useruser  192 Jun 27 00:18 opensfm/geocoords_transformation.txt
-rw-r--r-- 1 useruser useruser  49G Jul  3 15:40 mve/mve_dense_point_cloud.ply
-rw-r--r-- 1 useruser useruser  49G Jul  4 02:05 odm_filterpoints/point_cloud.ply
-rw-r--r-- 1 useruser useruser 7.3M Jul  4 02:23 odm_meshing/odm_mesh.ply
-rw-r--r-- 1 useruser useruser 7.3M Jul  4 05:50 odm_meshing/odm_25dmesh.ply
-rw-r--r-- 1 useruser useruser  36K Jul  4 08:13 odm_texturing_25d/odm_textured_model_geo.mtl
-rw-r--r-- 1 useruser useruser 2.3K Jul  4 09:45 odm_georeferencing_25d/odm_georeferencing_log.txt
-rw-r--r-- 1 useruser useruser  36K Jul  4 10:56 odm_texturing/odm_textured_model_geo.mtl
-rw-r--r-- 1 useruser useruser 4.9G Jul  4 12:26 odm_georeferencing/odm_georeferenced_model.laz
-rw-r--r-- 1 useruser useruser 1.8G Jul  4 18:43 odm_dem/dtm.tif
-rw-r--r-- 1 useruser useruser  16G Jul  4 21:27 odm_orthophoto/odm_orthophoto.tif
-rw-r--r-- 1 useruser useruser  845 Jul  4 21:29 odm_report/shots.geojson
smathermather commented 4 years ago
If we do a bit of calculations against these dates, we can see the progression: date stage stage length (h) overall time (d) stage percentage stage overall percentage
6/25/2020 16:20 start 0.00 0.00 0.00% 0.00%
6/27/2020 0:18 opensfm 31.97 1.33 14.45% 14.45%
7/3/2020 15:40 mve 159.37 7.97 72.06% 86.52%
7/4/2020 2:05 odm_filterpoints 10.42 8.41 4.71% 91.23%
7/4/2020 2:23 odm_meshing 0.30 8.42 0.14% 91.36%
7/4/2020 5:50 odm_meshing 3.45 8.56 1.56% 92.92%
7/4/2020 8:13 odm_texturing_25d 2.38 8.66 1.08% 94.00%
7/4/2020 9:45 odm_georeferencing_25d 1.53 8.73 0.69% 94.69%
7/4/2020 10:56 odm_texturing 1.18 8.78 0.54% 95.23%
7/4/2020 12:26 odm_georeferencing 1.50 8.84 0.68% 95.91%
7/4/2020 18:43 odm_dem 6.28 9.10 2.84% 98.75%
7/4/2020 21:27 odm_orthophoto 2.73 9.21 1.24% 99.98%
7/4/2020 21:29 odm_report 0.03 9.21 0.02% 100.00%
smathermather commented 4 years ago

Honestly, I expected OpenSfM to be a more expensive piece, and props to @pierotofy for bringing down that odm_dem number so much.

smathermather commented 4 years ago

Now I am testing with the same 7400 image dataset but resized using https://github.com/pierotofy/exifimageresize and resizing the images first to 1280 and then setting orthophoto and elevation model resolutions pretty low. The plan is to do hydrological modeling on these data and orthos have already been otherwise generating, so getting a good low res DEM is a good starting product, but why waste the memory and processing time on full resolution products:

First, log in to the docker instance:

docker run -ti -v /home/useruser/outdir:/datasets --entrypoint bash odmnolign

Then, we run the run.py script directly. This gives us some flexibility in resuming, or tweaking as we go:

./run.py --project-path /datasets --project-path /datasets datasetname --dtm --dsm --smrf-scalar 0 --smrf-slope 0.06 --smrf-threshold 1.05 --smrf-window 49 --dem-resolution 414 --depthmap-resolution 320 --ignore-gsd --orthophoto-resolution 138

I expect the OpenSfM step to take about the same length of time: 30+ hours, but depthmaps to be 4x as fast, which should get this done over the weekend instead of 9 days. 🤞

This time, I am also monitoring using Glances so I have the full curve of resource usage through the running of the project.

smathermather commented 4 years ago

I am going to keep this thread going and expand a bit to include info about how to process monster datasets. This will be messy, and riddled with contradictions, but I will ultimately probably end up as a section in docs.opendronemap.org, at least the how-to bits.

smathermather commented 4 years ago

We should probably add a flag for this, but I have removed the align portion of split merge. For massive datasets, the combination of SfM, error distribution, and OpenDroneMap's use of DOP and other accuracy tags seems to make the alignment step not just unnecessary but a source of error. Here's my branch which removes that step in a brutish way... : https://github.com/smathermather/ODM/tree/noalign

git clone https://github.com/smathermather/ODM
cd ODM
git checkout noalign
docker build -t odmnolign .
smathermather commented 4 years ago

Also, for massive datsets, perhaps the objective (as it is in my case) is to capture elevation models, not orthophotos. So one trick we can do is to resample the data down quite a bit, and then adjust the output resolutions to save on memory usage and processing time. A good tool for that is UAV4Geos https://github.com/pierotofy/exifimageresize which will resize an entire directory of images but keep the exif the same (aside from the exif on image dimensions).

I resized my dataset to 1280 which should leave enough features for matching and enough for depthmaps as well.

When I ran my 7400 image subset at default depthmap resolution of 640, it took 9 days, 5 hours to complete. Reducing that resolution to 320, we get completion in 3 days, 18 hours.

date stage stage length (h) overall time (d) stage percentage stage overall percentage
7/10/2020 3:40 start 0.00 0.00 0.00% 0.00%
7/11/2020 14:55 opensfm 35.25 1.47 39.07% 39.07%
7/13/2020 15:57 mve 49.03 3.51 54.34% 93.41%
7/13/2020 16:47 odm_filterpoints 0.83 3.55 0.92% 94.33%
7/13/2020 16:52 odm_meshing 0.08 3.55 0.09% 94.42%
7/13/2020 17:27 odm_meshing 0.58 3.57 0.65% 95.07%
7/13/2020 19:51 odm_texturing_25d 2.40 3.67 2.66% 97.73%
7/13/2020 20:12 odm_georeferencing_25d 0.35 3.69 0.39% 98.12%
7/13/2020 20:31 odm_texturing 0.32 3.70 0.35% 98.47%
7/13/2020 21:20 odm_georeferencing 0.82 3.74 0.91% 99.37%
7/13/2020 21:43 odm_dem 0.38 3.75 0.42% 99.80%
7/13/2020 21:52 odm_orthophoto 0.15 3.76 0.17% 99.96%
7/13/2020 21:54 odm_report 0.03 3.76 0.04% 100.00%
smathermather commented 4 years ago

Also, docker is cool, but when things fail (as they often do with big datasets) reconnecting to the docker machine to inspect can be difficult, since the ODM docker image is meant to turn off automatically when done (whether a failure or not). So, instead of the typical docker run, we can instead start the instance and run the run.py command from within the container. This will ensure we can reconnect, poke around, and do all the needful things:

docker run -ti -v /home/gisuser/outdir:/datasets --entrypoint bash odmnolign

Now we are logged in and can run our OpenDroneMap command:

./run.py --project-path /datasets --project-path /datasets resize --dtm --dsm --smrf-scalar 0 --smrf-
slope 0.06 --smrf-threshold 1.05 --smrf-window 49 --dem-resolution 413 --depthmap-resolution 320 

If we need to reconnect, we can look for our docker image:

docker ps

And then use the machine ID to connect

docker exec -it <docker hash or nickname> bash

We can also use docker logs to check in on the process:

docker logs <docker hash or nickname> | tail
smathermather commented 4 years ago

For particularly large datasets, like the 80,000+ image dataset I am currently attempting, we need to invoke split-merge by using the --split option and we probably also want to use hybrid bundle adjustment to increase the efficiency of our structure from motion step:

./run.py --project-path /datasets --project-path /datasets resize --dtm --dsm \
--smrf-scalar 0 --smrf-slope 0.06 --smrf-threshold 1.05 --smrf-window 49 \
--dem-resolution 413 --depthmap-resolution 320 --split 7000 --use-hybrid-bundle-adjustment
smathermather commented 4 years ago

A lot of folks don't necessarily understand the relationship between depthmap resolution and ground sampling resolution, so I should explain:

Suppose we have a 6400x4200 image with a ground sampling distance (pixel size) of 10cm. The maximum depthmap we want to use would be half the linear image resolution at 3200x2100, which could be specified with --depthmap-resolution 3200. This means the maximum elevation resolution should be 1/2 the ground sampling resolution or 20cm.

I often aim for 1/4 the resolution as it is a bit more robust to noise, especially on specular reflectors like metal rooftops, and saves a lot of computation time. In this case, I would set --depthmap-resolution 1600 and --dem-resolution 40.

In our case above, I have also resized the input images as well. Since I am not too interested in the orthophotos, this saves on memory usage in the texturing step, and as I want this to run as fast as possible, I have optimized for a depthmap resolution of 320.

Most of the time, OpenDroneMap uses good defaults, good optimization, and chooses these settings for us. But for massive datasets, we probably need to be more thoughtful than using the defaults. At least, until we have good standards for these larger datasets and can embody some of this rationale into the code base.

smathermather commented 4 years ago

A nearly useless useful utility to track feature extraction, written in shell:

# Find out how many images there are
totallines=`wc -l ~/outdir/znz/resize/img_list.txt | awk '{print $1}'`

previouspermillage=0
while true
do
    # Count the number of files in the OpenSfM features directory
    num1=`find ~/outdir/znz/resize/opensfm/features/ -type f | wc -l `

    # Calculate percentage of images that have had features extracted
    percentage=`echo $num1 / $totallines \* 100| bc -l`

    # We calculate permillage for the sake of determining whether we are 
    # going to display a new number
    permillage=`echo $num1 / $totallines \* 1000| bc -l`

    # If our rounded permillage number has increased, then display a new number
    if [ `printf "%.0f\n" $permillage` -gt `printf "%.0f\n" $previouspermillage` ]
    then
#        echo
        echo -n `printf "%.1f\n" $percentage`   
    fi

    # Speak less. Smile more.
    echo -n "."
    sleep 5

    # Check if we are done
    if [ `printf "%.0f\n" $percentage` -gt 99 ]
    then
        # Bing!
        echo -ne '\007'
        echo "All done with feature extraction!"
        break              
    fi 

previouspermillage=$permillage

done
coreysnipes commented 4 years ago

Just wanted to pop in and say I'm glad you're including this info here @smathermather .

smathermather commented 4 years ago

Hey thanks! I was hoping I wasn't creating unneeded noise. But, I need to track this, and I figure it can become a platform for further use and understanding.

smathermather commented 4 years ago

I have stopped using the above approach of logging into the docker instance and running the script. It seems like enough to just to remove the -rm flag when using docker run, and then it's possible to login to the machine if needed. I have also modified my branch a little more to default to BOW matching in the hopes that I can get faster matching

docker run -ti -v /home/gisuser/outdir/znz:/datasets odmnolign --project-path /datasets resize --dtm --dsm --smrf-scalar 0 --smrf-slope 0.06 --smrf-threshold 1.05 --smrf-window 100 --dem-resolution 413 --depthmap-resolution 320 --split 7000 --use-hybrid-bundle-adjustment
smathermather commented 4 years ago

One of the big challenges with massive datasets is catching issues before they are issues. In my case, that is finding small pockets of bad data. We probably need more robust error checking in the chain. So, for example, I had 4 files with some issue ( they probably didn't fully come back from glacier storage before being transferred). That seems a pretty reasonable error rate, but the challenge is that each restart after finding bad data is a setback of a few days of processing:

2020-07-23 18:04:15,105 DEBUG: Matching deduped2016-10-11_09.23.08.JPG and deduped2016-10-11_09.21.04.JPG.  Matcher: WORDS (symmetric) T-desc: 0.165 T-robust: 0.057 T-total: 0.223 Matches: 447 Robust: 430 Success: True
2020-07-23 18:04:15,120 DEBUG: No segmentation for geoscleanup2016-10-11_09.29.52.JPG, no features masked.
2020-07-23 18:04:15,388 DEBUG: Matching deduped2016-10-11_09.23.08.JPG and geoscleanup2016-10-11_09.29.52.JPG.  Matcher: WORDS (symmetric) T-desc: 0.163 T-robust: 0.105 T-total: 0.269 Matches: 543 Robust: 531 Success: True
Traceback (most recent call last):
  File "/code/SuperBuild/src/opensfm/bin/opensfm", line 34, in <module>
    command.run(args)
  File "/code/SuperBuild/src/opensfm/opensfm/commands/match_features.py", line 29, in run
    pairs_matches, preport = matching.match_images(data, images, images)
  File "/code/SuperBuild/src/opensfm/opensfm/matching.py", line 43, in match_images
    return match_images_with_pairs(data, exifs, ref_images, pairs), preport
  File "/code/SuperBuild/src/opensfm/opensfm/matching.py", line 67, in match_images_with_pairs
    matches = context.parallel_map(match_unwrap_args, args, processes, jobs_per_process)
  File "/code/SuperBuild/src/opensfm/opensfm/context.py", line 41, in parallel_map
    return Parallel(batch_size=batch_size)(delayed(func)(arg) for arg in args)
  File "/usr/local/lib/python2.7/dist-packages/joblib/parallel.py", line 934, in __call__
    self.retrieve()
  File "/usr/local/lib/python2.7/dist-packages/joblib/parallel.py", line 862, in retrieve
    raise exception.unwrap(this_report)
joblib.my_exceptions.JoblibIOError: JoblibIOError
___________________________________________________________________________
...........................................................................
/code/SuperBuild/src/opensfm/bin/opensfm in <module>()
     29 args = parser.parse_args()
     30 
     31 # Run the selected subcommand
     32 for command in subcommands:
     33     if args.command == command.name:
---> 34         command.run(args)
...........................................................................
/code/SuperBuild/src/opensfm/opensfm/commands/match_features.py in run(self=<opensfm.commands.match_features.Command instance>, args=Namespace(command='match_features', dataset='/datasets/resize/opensfm'))
     24     def run(self, args):
     25         data = dataset.DataSet(args.dataset)
     26         images = data.images()
     27 
     28         start = timer()
---> 29         pairs_matches, preport = matching.match_images(data, images, images)
        pairs_matches = undefined
        preport = undefined
        data = <opensfm.dataset.DataSet object>
        images = [u'DSC00034_2978.jpg', u'DSC00179_4743.jpg', u'DSC00180_4703.jpg', u'DSC00183_4739.jpg', u'DSC00184_4720.jpg', u'DSC00190_4051.jpg', u'DSC00209_4338.jpg', u'DSC00211_4349.jpg', u'DSC00213_4333.jpg', u'DSC00216_4301.jpg', u'DSC00221_3845.jpg', u'DSC00225_3831.jpg', u'DSC00229_4355.jpg', u'DSC00230_3827.jpg', u'DSC00230_4335.jpg', u'DSC00235_4346.jpg', u'DSC00240_4160.jpg', u'DSC00241_3752.jpg', u'DSC00241_4176.jpg', u'DSC00244_4182.jpg', ...]
     30         matching.save_matches(data, images, pairs_matches)
     31         end = timer()
     32 
     33         with open(data.profile_log(), 'a') as fout:
...........................................................................
/code/SuperBuild/src/opensfm/opensfm/matching.py in match_images(data=<opensfm.dataset.DataSet object>, ref_images=[u'DSC00034_2978.jpg', u'DSC00179_4743.jpg', u'DSC00180_4703.jpg', u'DSC00183_4739.jpg', u'DSC00184_4720.jpg', u'DSC00190_4051.jpg', u'DSC00209_4338.jpg', u'DSC00211_4349.jpg', u'DSC00213_4333.jpg', u'DSC00216_4301.jpg', u'DSC00221_3845.jpg', u'DSC00225_3831.jpg', u'DSC00229_4355.jpg', u'DSC00230_3827.jpg', u'DSC00230_4335.jpg', u'DSC00235_4346.jpg', u'DSC00240_4160.jpg', u'DSC00241_3752.jpg', u'DSC00241_4176.jpg', u'DSC00244_4182.jpg', ...], cand_images=[u'DSC00034_2978.jpg', u'DSC00179_4743.jpg', u'DSC00180_4703.jpg', u'DSC00183_4739.jpg', u'DSC00184_4720.jpg', u'DSC00190_4051.jpg', u'DSC00209_4338.jpg', u'DSC00211_4349.jpg', u'DSC00213_4333.jpg', u'DSC00216_4301.jpg', u'DSC00221_3845.jpg', u'DSC00225_3831.jpg', u'DSC00229_4355.jpg', u'DSC00230_3827.jpg', u'DSC00230_4335.jpg', u'DSC00235_4346.jpg', u'DSC00240_4160.jpg', u'DSC00241_3752.jpg', u'DSC00241_4176.jpg', u'DSC00244_4182.jpg', ...])
     38     # Generate pairs for matching
     39     pairs, preport = pairs_selection.match_candidates_from_metadata(
     40         ref_images, cand_images, exifs, data)
     41 
     42     # Match them !
---> 43     return match_images_with_pairs(data, exifs, ref_images, pairs), preport
        data = <opensfm.dataset.DataSet object>
        exifs = {u'DSC00034_2978.jpg': {u'band_name': u'RGB', u'camera': u'v2 sony dsc-wx220 1280 960 brown 0 rgb', u'capture_time': 1473250145.0, u'focal_ratio': 0, u'gps': {u'altitude': 74.435, u'latitude': -6.050773555555556, u'longitude': 39.21183877777778}, u'height': 960, u'make': u'SONY', u'model': u'DSC-WX220', u'orientation': 1, u'projection_type': u'brown', ...}, u'DSC00179_4743.jpg': {u'band_name': u'RGB', u'camera': u'v2 sony dsc-wx220 1280 960 brown 0 rgb', u'capture_time': 1473251824.0, u'focal_ratio': 0, u'gps': {u'altitude': 88.705, u'latitude': -6.061974111111111, u'longitude': 39.207787083333336}, u'height': 960, u'make': u'SONY', u'model': u'DSC-WX220', u'orientation': 1, u'projection_type': u'brown', ...}, u'DSC00180_4703.jpg': {u'band_name': u'RGB', u'camera': u'v2 sony dsc-wx220 1280 960 brown 0 rgb', u'capture_time': 1473251828.0, u'focal_ratio': 0, u'gps': {u'altitude': 86.979, u'latitude': -6.061924805555555, u'longitude': 39.20743983333334}, u'height': 960, u'make': u'SONY', u'model': u'DSC-WX220', u'orientation': 1, u'projection_type': u'brown', ...}, u'DSC00183_4739.jpg': {u'band_name': u'RGB', u'camera': u'v2 sony dsc-wx220 1280 960 brown 0 rgb', u'capture_time': 1473251839.0, u'focal_ratio': 0, u'gps': {u'altitude': 87.518, u'latitude': -6.06194775, u'longitude': 39.20635772222222}, u'height': 960, u'make': u'SONY', u'model': u'DSC-WX220', u'orientation': 1, u'projection_type': u'brown', ...}, u'DSC00184_4720.jpg': {u'band_name': u'RGB', u'camera': u'v2 sony dsc-wx220 1280 960 brown 0 rgb', u'capture_time': 1473251843.0, u'focal_ratio': 0, u'gps': {u'altitude': 87.861, u'latitude': -6.061936972222222, u'longitude': 39.205986055555556}, u'height': 960, u'make': u'SONY', u'model': u'DSC-WX220', u'orientation': 1, u'projection_type': u'brown', ...}, u'DSC00190_4051.jpg': {u'band_name': u'RGB', u'camera': u'v2 sony dsc-wx220 1280 960 brown 0 rgb', u'capture_time': 1473251820.0, u'focal_ratio': 0, u'gps': {u'altitude': 79.314, u'latitude': -6.059792916666667, u'longitude': 39.21130225}, u'height': 960, u'make': u'SONY', u'model': u'DSC-WX220', u'orientation': 1, u'projection_type': u'brown', ...}, u'DSC00209_4338.jpg': {u'band_name': u'RGB', u'camera': u'v2 sony dsc-wx220 1280 960 brown 0 rgb', u'capture_time': 1473251954.0, u'focal_ratio': 0, u'gps': {u'altitude': 88.492, u'latitude': -6.060617416666666, u'longitude': 39.21422186111111}, u'height': 960, u'make': u'SONY', u'model': u'DSC-WX220', u'orientation': 1, u'projection_type': u'brown', ...}, u'DSC00211_4349.jpg': {u'band_name': u'RGB', u'camera': u'v2 sony dsc-wx220 1280 960 brown 0 rgb', u'capture_time': 1473251961.0, u'focal_ratio': 0, u'gps': {u'altitude': 89.117, u'latitude': -6.060633027777778, u'longitude': 39.21359697222223}, u'height': 960, u'make': u'SONY', u'model': u'DSC-WX220', u'orientation': 1, u'projection_type': u'brown', ...}, u'DSC00213_4333.jpg': {u'band_name': u'RGB', u'camera': u'v2 sony dsc-wx220 1280 960 brown 0 rgb', u'capture_time': 1473251969.0, u'focal_ratio': 0, u'gps': {u'altitude': 87.932, u'latitude': -6.060612166666666, u'longitude': 39.21292638888889}, u'height': 960, u'make': u'SONY', u'model': u'DSC-WX220', u'orientation': 1, u'projection_type': u'brown', ...}, u'DSC00216_4301.jpg': {u'band_name': u'RGB', u'camera': u'v2 sony dsc-wx220 1280 960 brown 0 rgb', u'capture_time': 1473251980.0, u'focal_ratio': 0, u'gps': {u'altitude': 86.411, u'latitude': -6.060592916666667, u'longitude': 39.212014583333335}, u'height': 960, u'make': u'SONY', u'model': u'DSC-WX220', u'orientation': 1, u'projection_type': u'brown', ...}, ...}
        ref_images = [u'DSC00034_2978.jpg', u'DSC00179_4743.jpg', u'DSC00180_4703.jpg', u'DSC00183_4739.jpg', u'DSC00184_4720.jpg', u'DSC00190_4051.jpg', u'DSC00209_4338.jpg', u'DSC00211_4349.jpg', u'DSC00213_4333.jpg', u'DSC00216_4301.jpg', u'DSC00221_3845.jpg', u'DSC00225_3831.jpg', u'DSC00229_4355.jpg', u'DSC00230_3827.jpg', u'DSC00230_4335.jpg', u'DSC00235_4346.jpg', u'DSC00240_4160.jpg', u'DSC00241_3752.jpg', u'DSC00241_4176.jpg', u'DSC00244_4182.jpg', ...]
        pairs = [(u'deduped2016-08-02_14.23.18.JPG', u'all2017-05-23_11.29.11.JPG'), (u'deduped2017-02-10_08.52.13.JPG', u'deduped2017-02-10_08.51.39.JPG'), (u'deduped2016-09-08_12.17.18.JPG', u'deduped2016-09-08_14.10.05.JPG'), (u'deduped2000-01-01_00.47.13.JPG', u'geoscleanup2000-01-01_00.47.24.JPG'), (u'deduped2016-09-07_10.42.44.JPG', u'deduped2016-09-07_09.33.17.JPG'), (u'deduped2016-09-27_09.01.55.JPG', u'deduped2016-09-27_09.02.02.JPG'), (u'deduped2016-08-31_09.57.52.JPG', u'deduped2016-08-31_09.59.03.JPG'), (u'geoscleanup2000-01-01_00.23.35.JPG', u'geoscleanup2000-01-01_00.24.35.JPG'), (u'deduped2016-09-21_11.45.01.JPG', u'deduped2016-08-31_09.12.46.JPG'), (u'deduped2017-06-16_08.50.19.JPG', u'deduped2017-06-16_08.50.25.JPG'), (u'deduped2017-06-17_09.04.08.JPG', u'deduped2017-06-17_09.04.03.JPG'), (u'deduped2017-06-24_09.36.32.JPG', u'deduped2017-06-24_09.18.45.JPG'), (u'deduped2016-09-01_09.52.58.JPG', u'deduped2016-09-01_10.10.46.JPG'), (u'deduped2016-09-02_08.38.28.JPG', u'deduped2016-09-02_13.37.31.JPG'), (u'deduped2016-08-03_14.58.03.JPG', u'geoscleanup2016-07-30_14.22.00.JPG'), (u'deduped2016-09-06_11.11.15.jpg', u'deduped2016-09-06_11.11.11.jpg'), (u'deduped2016-09-21_09.43.26.JPG', u'deduped2016-09-21_09.43.31.JPG'), (u'deduped2016-09-01_09.58.39.JPG', u'deduped2016-09-01_09.58.43.JPG'), (u'deduped2016-09-07_10.06.38.jpg', u'deduped2016-09-07_09.32.13.jpg'), (u'deduped2016-09-07_09.3.

I still need to take a look at the 4 bad files and find out why they are failing, but it seems we are having uncaught errors in the feature extraction step, and then these stop the toolchain at the matching stage.

smathermather commented 4 years ago

Now we are in the matcher stage for these data. As noted above, I forced barrel of words matching to get a speed bump. So, how do we track progress here? We can look at the logs and see how many matches in we are:

docker logs great_golick | grep "Matching" | wc -l
102155

But that just gives us some relative progress. How many matches should we have? The total number of images helps inform this:

wc -l ~/outdir/znz/resize/img_list.txt | awk '{print $1}'
82087

We can assume the total number of matches will be the number of images 82087 times the number of matcher images default: 20 8 for 656,696.

Oof. We've got a way to go.

smathermather commented 4 years ago

We can wrap the above in a monitoring script which displays a number each time we progress by 1/10th of a percentage:

#!/bin/bash
# Find out how many images there are
totallines=`wc -l ~/outdir/znz/resize/img_list.txt | awk '{print $1}'`

previouspermillage=0
while true
do
    # Count the number of matches
    num1=`docker logs great_golick | grep "Matching" | wc -l`

    # Calculate percentage of images that have had features extracted
    percentage=`echo $num1 / $totallines / 8 \* 100| bc -l`

    # We calculate permillage for the sake of determining whether we are 
    # going to display a new number
    permillage=`echo $num1 / $totallines / 8 \* 1000| bc -l`

    # If our rounded permillage number has increased, then display a new number
    if [ `printf "%.0f\n" $permillage` -gt `printf "%.0f\n" $previouspermillage` ]
    then
#        echo
        echo -n `printf "%.1f\n" $percentage`   
    fi

    # Speak less. Smile more.
    echo -n "."
    sleep 5

    # Check if we are done
    if [ `printf "%.0f\n" $percentage` -gt 99 ]
    then
        # Bing!
        echo -ne '\007'
        echo "All done with matching!"
        break              
    fi 

previouspermillage=$permillage

done
smathermather commented 4 years ago

👆🏻 This doesn't really work, btw. I don't know how to estimate how many matches we are supposed to have.

smathermather commented 4 years ago

Buying instead of Renting

Usually if we do split-merge, we are renting some infrastructure, and the autoscaler in ClusterODM is the way to go. It's a fantastic tool. But sometimes, we have infrastructure we own, so I will document a bit of the management of owned infrastructure here. It will eventually make it's way into docs.opendronemap.org as it matures, but I want to capture it here while it's fresh.

smathermather commented 4 years ago

Parallel shell use:

parallel-ssh is what I am using to manage all the vms that I have across our infrastructure. I have 3 sizes of machine. I have one monstrous 20-core, 768GB machine; a single 12-core, 192GB machine that is no slouch, and 4 x 96GB, 8 core machines as well. So, they need a bit of different care and feeding in use.

Connecting to hosts

Logins to the machines are handled in my .ssh/config:

Host proxyhost
        User useruser
        Hostname proxyhost.someaddress.com
        Port 22

Host webodm
        User loseruser
        Hostname 192.168.0.10
        Port 22
        ProxyCommand ssh -q -W %h:%p proxyhost

Host node1
        User winorloser
        Hostname 192.168.0.11
        Port 22
        ProxyCommand ssh -q -W %h:%p proxyhost

Host node2
        User bob
        Hostname 192.168.0.12
        Port 22
        ProxyCommand ssh -q -W %h:%p proxyhost
....

Clusters of hosts

Clusters of hosts which need controlled together are than contained in hosts files as follows:

First all the machines for when we want to run commands on all:

more .ssh/odmall
webodm
node1
node2
node3
node4
node5

Then the groups of hosts that we want to control. For simplicity, I do have groups of one, just to simplify my work flow. Alternatively, I could use just ssh for these, instead of parallel-ssh.

The parent host file contains just one machine:

more .ssh/odmrent
webodm

The big machine is in a host file all it's own:

more .ssh/odmbigs
node5

All the smaller machines are together in a file:

more  .ssh/odmsmoll
node1
node2
node3
node4

And finally, we have a hosts file with just the child nodes:

more  .ssh/odmhosts
node1
node2
node3
node4
node5

Now we can do things with these groups. Let's dive in.

Using clusters of hosts

Killing all the NodeODM Instances

We might want to kill all the NodeODM docker machines on the host when there are failures. We can do so as follows:

parallel-ssh  --timeout 15 -i -h ~/.ssh/odmall 'docker kill $(docker ps -a -q)'

[1] 17:27:42 [FAILURE] webodm Exited with error code 1
Stderr: "docker kill" requires at least 1 argument.
See 'docker kill --help'.

Usage:  docker kill [OPTIONS] CONTAINER [CONTAINER...]

Kill one or more running containers
[2] 17:27:42 [SUCCESS] node5
43b423c3aa25
[3] 17:27:42 [SUCCESS] node4
70452bf0691e
[4] 17:27:42 [SUCCESS] node2
20da8ae8b876
[5] 17:27:43 [SUCCESS] node1
62440df4f007
[6] 17:27:43 [SUCCESS] node3
b202fd6258ed

Running NodeODM on the hosts:

Since we have 3 sizes of machines, we need different commands to start NodeODM on each. We set max-concurrency determined by number of cores and max_images determined by available RAM on initiation:

parallel-ssh --timeout 15 -i -h ~/.ssh/odmrent "docker run -p 3000:3000 opendronemap/nodeodm --max_concurrency 20 --max_images 1000000&"parallel-ssh --timeout 15 -i -h ~/.ssh/odmrent "docker run -p 3000:3000 opendronemap/nodeodm --max_concurrency 20 --max_images 1000000&"

[1] 17:32:14 [SUCCESS] webodm

parallel-ssh --timeout 15 -i -h ~/.ssh/odmbigs "docker run -p 3000:3000 opendronemap/nodeodm --max_concurrency 12 --max_images 10000&"

[1] 17:32:44 [FAILURE] node5 Timed out, Killed by signal 9
info: Authentication using NoTokenRequired
info: Listening on 0.0.0.0:6367 UDP for progress updates
info: No tasks dump found
info: Checking for orphaned directories to be removed...
info: Server has started on port 3000

parallel-ssh --timeout 15 -i -h ~/.ssh/odmsmoll "docker run -p 3000:3000 opendronemap/nodeodm --max_concurrency 8 --max_images 5000&"
[1] 17:33:23 [FAILURE] node1 Timed out, Killed by signal 9
info: Authentication using NoTokenRequired
info: Listening on 0.0.0.0:6367 UDP for progress updates
info: No tasks dump found
info: Checking for orphaned directories to be removed...
info: Server has started on port 3000
[2] 17:33:23 [FAILURE] node2 Timed out, Killed by signal 9
info: Authentication using NoTokenRequired
info: Listening on 0.0.0.0:6367 UDP for progress updates
info: No tasks dump found
info: Checking for orphaned directories to be removed...
info: Server has started on port 3000
[3] 17:33:23 [FAILURE] node3 Timed out, Killed by signal 9
info: Authentication using NoTokenRequired
info: Listening on 0.0.0.0:6367 UDP for progress updates
info: No tasks dump found
info: Checking for orphaned directories to be removed...
info: Server has started on port 3000
[4] 17:33:23 [FAILURE] node4 Timed out, Killed by signal 9
info: Authentication using NoTokenRequired
info: Listening on 0.0.0.0:6367 UDP for progress updates
info: No tasks dump found
info: Checking for orphaned directories to be removed...
info: Server has started on port 3000

I set a short timeout here, but for the first time this is run, we probably want a much longer time out so that docker has time to download the images.

smathermather commented 4 years ago

Stitching the cluster together

Now we need a ClusterODM instance to use these nodes. It would be really easy to do this as a docker instance, but then we can't use our host names so easily there, and ClusterODM is easy enough to run locally, so we'll do that. First we add them to our hosts file:

sudo vi /etc/hosts
127.0.0.1       localhost.localdomain   localhost
::1             localhost6.localdomain6 localhost6

192.168.0.21       node1
192.168.0.22       node2
192.168.0.23       node3
192.168.0.24       node4
192.168.0.25       node5

# The following lines are desirable for IPv6 capable hosts
::1     localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

First we get and install:

git clone https://github.com/OpenDroneMap/ClusterODM
cd ClusterODM
npm install

Then we run our ClusterODM node, in this case with a limit on network speed to avoid packet losses. It'll still do things at pretty fast:

node index.js --public-address http://192.168.0.10:3000 --upload-max-speed 200000000 &

ClusterODM, needs the nodes attached. For this we telnet in:

telnet localhost 8080
Trying ::1...
Connected to localhost.localdomain.
Escape character is '^]'.
Welcome ::1:33106 ClusterODM:1.4.3
HELP for help
QUIT to quit
#>

And we need to add our hosts:

#> NODE ADD webodm 3001
OK
#> NODE ADD node1 3000
OK
#> NODE ADD node2 3000
OK
#> NODE ADD node3 3000
OK
#> NODE ADD node4 3000
OK
#> NODE ADD node5 3000
OK

We can list our available hosts:

#> NODE LIST
1) webodm:3001 [online] [0/1] <version 1.6.1>
2) node1:3000 [online] [0/1] <version 1.6.1>
3) node2:3000 [online] [0/1] <version 1.6.1>
4) node3:3000 [online] [0/1] <version 1.6.1>
5) node4:3000 [online] [0/1] <version 1.6.1>
6) node5:3000 [online] [0/1] <version 1.6.1>

And our settings when we started these nodes will help the load balancing of any split merge work. We can check this with NODE BEST <integer>, e.g.:

#> NODE BEST 5000
1) node1:3000 [online] [0/1] <version 1.6.1>
#> NODE BEST 9000
1) node5:3000 [online] [0/1] <version 1.6.1>
#> NODE BEST 50000
1) webodm:3001 [online] [0/1] <version 1.6.1>