GeoscienceAustralia / uncover-ml

Machine Learning system for Geoscience Australia uncover project
Apache License 2.0
30 stars 20 forks source link

Disc write takes majority of time during prediction #50

Closed basaks closed 5 years ago

basaks commented 5 years ago

Some prediction benchmarks using multirandomforest and only 5 covariates on a national dataset for one partition (out of 100):

Data prep, imputing, everything takes about (32-04) = 28 seconds.
Prediction takes about (74-32) = 42 seconds
Writing the prediction output to disc takes about (1125-0974) = ~151 seconds

Writing to disc take about 68% of the time.

The writing to the disc currently in uncoverml uses only one cpu. It can be improved using multiple cpus.

Here is one partition during the prediction:

+10904s uncoverml.scripts.uncoverml:INFO starting to render partition 61
+10911s uncoverml.geoio:INFO /g/data1a/ge3/covariates/national_albers_filled_new/albers_cropped/Dose_2016.tif: [19656796]px 11.20% missing
+10916s uncoverml.geoio:INFO /g/data1a/ge3/covariates/national_albers_filled_new/albers_cropped/IR_Grav.tif: [19663991]px 11.17% missing
+10921s uncoverml.geoio:INFO /g/data1a/ge3/covariates/national_albers_filled_new/albers_cropped/Potassium_2016.tif: [19656821]px 11.20% missing
+10924s uncoverml.geoio:INFO /g/data1a/ge3/covariates/national_albers_filled_new/albers_cropped/Rad2016K_Th.tif: [22136576]px 0.00% missing
+10928s uncoverml.geoio:INFO /g/data1a/ge3/covariates/national_albers_filled_new/albers_cropped/Rad2016K_UTH.tif: [19656796]px 11.20% missing
+10928s uncoverml.predict:INFO Applying feature transforms
+10928s uncoverml.transforms.transformset:INFO Imputing 8.96% missing data
+10929s uncoverml.transforms.transformset:INFO Imputing 0.00% missing data
+10932s uncoverml.predict:INFO Areas with mask=0 will be predicted
+10932s uncoverml.predict:INFO Loaded 0.8855GB of image data
+10932s uncoverml.predict:INFO Predicting targets for multirandomforest.
+10974s uncoverml.geoio:INFO Writing partition to output file
+11125s uncoverml.scripts.uncoverml:INFO starting to render partition 62
basaks commented 5 years ago

May be something like this? https://rasterio.readthedocs.io/en/stable/topics/concurrency.html Can we do an mpi equivalent of the above?

basaks commented 5 years ago

Turns out this was introduced due to compression and tiling during file write. New code changes allow optional compression and file write via config file.