SlideRuleEarth / sliderule

Server and client framework for on-demand science data processing in the cloud
https://slideruleearth.io
Other
27 stars 12 forks source link

Saving output to parquet scrambles points #379

Closed SmithB closed 4 months ago

SmithB commented 7 months ago

When I save sliderule output to a local file and enable the 'open_on_complete' option, the dataframe h_mean field is scrambled relative to the geometry field.

region=[{'lat': 69.95536798500007, 'lon': -27.338821302999975}, {'lat': 69.96134097100008, 'lon': -27.41378932599997}, {'lat': 70.00485198100006, 'lon': -27.94445625999998}, {'lat': 70.04489099100005, 'lon': -28.430339370999945}, {'lat': 70.04682093500003, 'lon': -28.437871333999965}, {'lat': 70.05568600200007, 'lon': -28.46451123199995}, {'lat': 70.05699893500008, 'lon': -28.467971424999973}, {'lat': 70.37890593200007, 'lon': -29.07789428799998}, {'lat': 70.38314796600008, 'lon': -29.08488241799995}, {'lat': 70.38548999600005, 'lon': -29.08836141599994}, {'lat': 70.41485498300005, 'lon': -29.109849251999947}, {'lat': 70.41633599600004, 'lon': -29.110683267999946}, {'lat': 70.43761400800008, 'lon': -29.121194383999978}, {'lat': 72.05856300000005, 'lon': -28.842372458999932}, {'lat': 72.05995099800003, 'lon': -28.841072312999984}, {'lat': 72.06101199000005, 'lon': -28.83889329799996}, {'lat': 72.09917392500006, 'lon': -28.661121289999983}, {'lat': 72.10018098500007, 'lon': -28.642110444999958}, {'lat': 71.36114495400005, 'lon': -24.69513343099993}, {'lat': 71.34807493300008, 'lon': -24.635889265999936}, {'lat': 70.85243200600007, 'lon': -22.429544190999934}, {'lat': 70.85201200100005, 'lon': -22.42873307399998}, {'lat': 70.44017048600006, 'lon': -21.66382393799995}, {'lat': 70.15249599600008, 'lon': -22.067161032999934}, {'lat': 70.13031699800007, 'lon': -22.225189091999937}, {'lat': 70.12006296400006, 'lon': -22.29837198299998}, {'lat': 70.10610897400005, 'lon': -22.449150016999965}, {'lat': 70.09590900600006, 'lon': -22.56953808299994}, {'lat': 70.08132897500008, 'lon': -22.758966229999942}, {'lat': 69.95536798500007, 'lon': -27.338821302999975}]

parms = { "poly": region, "srt": icesat2.SRT_LAND, "cnf": icesat2.CNF_SURFACE_LOW, "ats": 10.0, "cnt": 10, "len": 40.0, "res": 20.0, "maxi": 6, } output_dict={"path":os.path.join(os.getcwd(), 'Scoresby.parquet'), "format":"parquet", "open_on_complete":True}

output to file (comment this out to run without saving to geoparquet

parms['output']=output_dict

atl06_sr = icesat2.atl06p(parms)

atl06_sr['longitude']=np.array(atl06_sr.geometry.x) atl06_sr['latitude']=np.array(atl06_sr.geometry.y)

plt.figure() plt.scatter(atl06_sr['longitude'][::20], atl06_sr['latitude'][::20],2, c=atl06_sr['h_mean'][::20], vmin=40, vmax=1000)

results with save enabled:

image

results with save disabled [ this is what I expect the results to look like]:

image

jpswinski commented 7 months ago

@SmithB, I was able to recreate the problem and confirm what you are seeing. Specifically, when the geoparquet option is enabled (which is the default), the geometry column contains duplicated x,y coordinates. When the parquet option is enabled (which is done by setting as_geo to False), then the problem goes away.

Upon investigation, the issue is due to a bug in the way the latitude and longitude fields are decoded when they are used for the geometry column. The fields were not being appropriately identified as "batch" fields and therefore only the first record in each batch (~256 elevations) was having its latitude and longitude read, and those values were being applied to the rest of the batch.

This bug has been fixed with commit e94161ae. The code now correctly reads each latitude and longitude.