CartoDB / raster-loader

https://raster-loader.readthedocs.io
Other
15 stars 4 forks source link

[BUG] Performance regression #74

Closed francois-baptiste closed 1 year ago

francois-baptiste commented 1 year ago

Bug Description

Importing this file into BQ took 55 minutes using this code:

wget http://www.cec.org/files/atlas_layers/0_reference/0_03_elevation/elevation_tif.zip
unzip elevation_tif.zip
carto bigquery upload \
    --file_path Elevation_TIF/NA_Elevation/data/na_elevation.tif \
    --project cartodb-data-engineering-team \
    --dataset jgoizueta_tmp \
    --table dem \
    --overwrite

while the python script finishes in 4 mins

import pandas as pd
import pyproj
import rasterio
import rio_cogeo
from google.cloud import bigquery

client = bigquery.Client()

def chunks(lst, n):
    """Yield successive n-sized chunks from lst."""
    for i in range(0, len(lst), n):
        yield lst[i:i + n]
def calculate_coordinate(pyproj_transformer, dataset_transform,row,col):
    return pyproj_transformer.transform(*rasterio.transform.xy(dataset_transform, row, col))

jobs=[]

geotiff_path="D:/data/na_elevation.tif"

raster_info = rio_cogeo.cog_info(geotiff_path).dict()
max_zoom = raster_info["GEO"]["MaxZoom"]

dst_crs = pyproj.CRS.from_epsg(4326)

with rasterio.open(geotiff_path) as dataset:
    print(dataset.transform)

    src_crs = pyproj.CRS.from_wkt(dataset.crs.to_wkt())
    transformer = pyproj.Transformer.from_crs(src_crs, dst_crs) #compute lat and lon

    mylist=list(dataset.block_windows())

    for window_chunk in chunks(mylist, 100): #tune the number of elem in chunck depending on your RAM
        mydf = pd.DataFrame([(*calculate_coordinate(transformer,dataset.transform,window.row_off+0,window.col_off+0),
                              *calculate_coordinate(transformer,dataset.transform,window.row_off+0,window.col_off+window.width),
                              *calculate_coordinate(transformer,dataset.transform,window.row_off+window.height,window.col_off+window.width),
                              *calculate_coordinate(transformer,dataset.transform,window.row_off+window.height,window.col_off+0),
                              row_off,col_off,
                              window.height,window.width,
                              dataset.read(1, window=window).tobytes())
        for (row_off,col_off), window in window_chunk],columns=['lat_NW','lon_NW','lat_NE','lon_NE','lat_SE','lon_SE','lat_SW','lon_SW','block_height_idx','block_width_idx','block_height','block_width','band1_float32'])

        try:
           print(jobs.pop().result())
        except IndexError:
            pass

        print(mydf.size) #print something
        jobs.append(client.load_table_from_dataframe(mydf, 'cartodb-gcp-backend-data-team.fbaptiste.na_elevation'))

print(jobs.pop().result())

System information [Run carto info in a terminal and add the output here, overwriting the text below.]

Raster Loader version: 0.1.1.dev8+gd70e367.d20230117
Python version: 3.10.6
Platform: Linux-5.15.79.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
System version: Linux 5.15.79.1-microsoft-standard-WSL2
Machine: x86_64
Processor: x86_64
Architecture: 64bit
francois-baptiste commented 1 year ago

I found several anti patterns where some imports are done in a loop, considerably affecting the upload time. https://github.com/CartoDB/raster-loader/blob/fc88b407299e008c5c80b8d9c4dbdb8351af6ca7/raster_loader/io.py#L95

These issues are fixed in the PR #72 in this commit https://github.com/CartoDB/raster-loader/pull/72/commits/11f05267b2cd72f32f2d629d4211ca24c7b83037

brendancol commented 1 year ago

@francois-baptiste happy to help with this one. Feel free to assign me issues and I can help triage

francois-baptiste commented 1 year ago

Thank you @brendancol I found another anti pattern in the code. The dataframe to be uploaded to bigquery is built from a list of dict. That is much slower than building it from the tuple and column list like I did in the original script. Can you fix this one forking the quadbin branch ?

francois-baptiste commented 1 year ago

Thank you @brendancol I found another anti pattern in the code. The dataframe to be uploaded to bigquery is built from a list of dict. That is much slower than building it from the tuple and column list like I did in the original script. Can you fix this one forking the quadbin branch ?

The issue was not were I thought it was. Creating a pyproj.transformer for each loop was the issue. Now we are on par with the preformance of my original script 🚀