Esri / arcgis-python-api

Documentation and samples for ArcGIS API for Python
https://developers.arcgis.com/python/
Apache License 2.0
1.9k stars 1.1k forks source link

Enhanced Logging when Row cannot be added to Spatially Enabled Dataframe #1494

Closed FeralCatColonist closed 1 year ago

FeralCatColonist commented 1 year ago

Is your feature request related to a problem? Please describe. When ingesting records into a spatially enabled dataframe from another source, a warning occurs stating:

Could not insert the row because of the error message: value #XX - unsupported type: NAType. Recheck your data.

This warning/error would be significantly more helpful if it stated the record row number and the column/field name or index. Is the value number supposed to be the column? I don't know and the end user probably doesn't either.

image

sedf = arcgis.features.GeoAccessor.from_featureclass(featureclass)

Describe the solution you'd like Enrich the error message to include:

nanaeaubry commented 1 year ago

This error doesn't seem to come from our end but we will investigate. Thanks for bringing it up

nanaeaubry commented 1 year ago

@FeralCatColonist Do you have arcpy in your environment?

Can you also provide sample data for this so we can test out?

FeralCatColonist commented 1 year ago

@nanaeaubry, yes this environment is using ArcPy in addition to the ArcGIS API for Python. I can get you that data later this evening, what's a good method for sending?

nanaeaubry commented 1 year ago

@FeralCatColonist Can you drop it here? or make a dropbox for it? Otherwise we can figure something out.

We just need a subset or some data that shows this behavior if you can't post it all

FeralCatColonist commented 1 year ago

@nanaeaubry, I was just digging into it and the issue was with to_featureclass() and not from_featureclass(); I didn't have enough signposting in my script previously and I guessed wrong on what function was to blame. It appears to be the same issue as described here: https://community.esri.com/t5/arcgis-api-for-python-ideas/support-a-pd-na-to-np-nan-conversion-inside-the/idi-p/1266352/

I'm running a series of data hygiene operations against the data and I've introduced some NA types as part of that process. Here's an export to parquet from that intermediate process-- --I'm unsure if this will preserve the error or not. dirty_addresses.zip

nanaeaubry commented 1 year ago

Ok I dug a bit into it and the error message is occurring from the arcpy call being made in the method. So I will pass the message onto them but if you want tracking on progress you have to go through support unfortunately since we don't track that here.

nanaeaubry commented 1 year ago

@FeralCatColonist I have alerted the arcpy team to this and they are aware but they suggest you post in their community page here: https://community.esri.com/t5/arcgis-pro-ideas/idb-p/arcgis-pro-ideas/label-name/python

Thanks!

For reference here is the arcpy call we are making: arcpy.CreateFeatureclass_management

FeralCatColonist commented 1 year ago

@nanaeaubry so the ArcPy would be the one to handle that NA conversion in the resources that the to_featureclass() process is calling?

FeralCatColonist commented 1 year ago

All done, thank you so much @nanaeaubry https://community.esri.com/t5/arcgis-pro-ideas/arcpy-to-handle-na-types-when-utilizing-to/idi-p/1269036

nanaeaubry commented 1 year ago

@nanaeaubry so the ArcPy would be the one to handle that NA conversion in the resources that the to_featureclass() process is calling?

Ah so you want us to fix the conversion of none types as well then? I thought this was an enhancement for the error message only. I will talk to our team and see then

achapkowski commented 1 year ago

Can you post a raw rows of your data?

achapkowski commented 1 year ago

What version of the API are you using and what is your export format?

When I do this: sdf.spatial.to_featureclass(r"c:\temp\scratch.gdb\testaddress")

I do not get an issue with the latest release. All 347,087 records are written out.

FeralCatColonist commented 1 year ago

@achapkowski, I'm assuming that the problematic rows were dropped upon conversion to_parquet(). How would you recommend I post the raw rows? Python API Version: 2.1.0.2

achapkowski commented 1 year ago

Zip the raw data and send it in the issue or try and post an example with problem data.

hannesatwork commented 1 year ago

@nanaeaubry can we now use Arrow table to accomplish this? You have tbl = <sedf>.spatial.to_arrow(), then use arcpy.management.CopyFeatures(tbl, ...) to create the featureclass. Should be a little faster too.

FeralCatColonist commented 1 year ago

Thanks for being so patient everyone; here are all the records that cause errors in the transformation process. @achapkowski the total 347,087 is correct for the raw data, you can connect the error-inducing records with the unique field ["Site_NGUID"] Here is a listing in excel and parquet: github.zip

Further, I've isolated the intermediary step to these two functions responsible for producing the NA Type:

def addresses_add_fields(sedf):
    """This subroutine adds fields from scratch or copies based on existing fields"""
    sedf["Addtl_Loc1"] = ""
    sedf["Building"] = ""
    sedf["Inc_Muni"] = ""
    sedf["NCT_Type"] = ""
    sedf["Place_Type"] = ""
    sedf["Room"] = ""
    sedf["Uninc_Comm"] = ""
    sedf["Unit"] = ""
    sedf["DiscrpAgID"] = "DENCOAREA911.DST.TX.US"
    sedf["Nbrhd_Comm"] = sedf["Alias1"].str[:100]

    return sedf

def addresses_data_hygiene(sedf):
    """This subroutine performs simple data hygiene operations"""
    sedf["Addtl_Loc2"] = (
        sedf["Alias1"]
        .str.cat(sedf[["Alias2", "Alias3", "Alias4", "Alias5"]], sep=" ", na_rep="",)
        .str.strip()
        .str[:255]
    )
    sedf["Building"].mask(
        cond=sedf["ALTUNITTYPE"] == "BLDG", other=sedf["ALTUNITID"], inplace=True,
    )
    sedf["Building"].mask(
        cond=sedf["UNITTYPE"] == "BLDG", other=sedf["UNITID"], inplace=True,
    )
    sedf["GPSX"].mask(
        cond=(sedf["GPSY"] == 0) | (sedf["GPSY"] > 40), other=0, inplace=True,
    )
    sedf["GPSX"].mask(cond=sedf["GPSX"] > 0, other=0, inplace=True)
    sedf["GPSY"].mask(cond=sedf["GPSY"] > 40, other=0, inplace=True)
    sedf["Inc_Muni"].where(
        cond=sedf["MUNICIPALITY"].str.contains("COUNTY", regex=True),
        other=sedf["MUNICIPALITY"],
        inplace=True,
    )
    sedf["Room"].mask(
        cond=sedf["UNITTYPE"] == "RM", other=sedf["UNITID"], inplace=True,
    )
    sedf["Uninc_Comm"].mask(
        cond=sedf["MUNICIPALITY"].str.contains("COUNTY", regex=True),
        other=sedf["MUNICIPALITY"],
        inplace=True,
    )
    sedf["Unit"].where(
        cond=(sedf["UNITTYPE"] == "BLDG") | (sedf["UNITTYPE"] == "RM"),
        other=sedf["UNITID"],
        inplace=True,
    )
    return sedf
achapkowski commented 1 year ago

So this was a data cleaning issue.

  1. The first problem was the Unit column. Pandas could not infer the data type.
  2. The second problem was all the null values. This is not uncommon, but as a best practice, we try not to mutate/change data that a user provides. In your case it might be helpful, but downstream workflows it could cause problems if dtypes, column names, etc... change randomly.... No one wants that.

Below is some code to get it to work:

import sys

import json
import pandas as pd

import numpy as np
from arcgis.gis import GIS

def addresses_add_fields(sedf):
    """This subroutine adds fields from scratch or copies based on existing fields"""
    sedf["Addtl_Loc1"] = ""
    sedf["Building"] = ""
    sedf["Inc_Muni"] = ""
    sedf["NCT_Type"] = ""
    sedf["Place_Type"] = ""
    sedf["Room"] = ""
    sedf["Uninc_Comm"] = ""
    sedf["Unit"] = ""
    sedf["DiscrpAgID"] = "DENCOAREA911.DST.TX.US"
    sedf["Nbrhd_Comm"] = sedf["Alias1"].str[:100]

    return sedf

def addresses_data_hygiene(sedf):
    """This subroutine performs simple data hygiene operations"""
    sedf["Addtl_Loc2"] = (
        sedf["Alias1"]
        .str.cat(
            sedf[["Alias2", "Alias3", "Alias4", "Alias5"]],
            sep=" ",
            na_rep="",
        )
        .str.strip()
        .str[:255]
    )
    sedf["Building"].mask(
        cond=sedf["ALTUNITTYPE"] == "BLDG",
        other=sedf["ALTUNITID"],
        inplace=True,
    )
    sedf["Building"].mask(
        cond=sedf["UNITTYPE"] == "BLDG",
        other=sedf["UNITID"],
        inplace=True,
    )
    sedf["GPSX"].mask(
        cond=(sedf["GPSY"] == 0) | (sedf["GPSY"] > 40),
        other=0,
        inplace=True,
    )
    sedf["GPSX"].mask(cond=sedf["GPSX"] > 0, other=0, inplace=True)
    sedf["GPSY"].mask(cond=sedf["GPSY"] > 40, other=0, inplace=True)
    sedf["Inc_Muni"].where(
        cond=sedf["MUNICIPALITY"].str.contains("COUNTY", regex=True),
        other=sedf["MUNICIPALITY"],
        inplace=True,
    )
    sedf["Room"].mask(
        cond=sedf["UNITTYPE"] == "RM",
        other=sedf["UNITID"],
        inplace=True,
    )
    sedf["Uninc_Comm"].mask(
        cond=sedf["MUNICIPALITY"].str.contains("COUNTY", regex=True),
        other=sedf["MUNICIPALITY"],
        inplace=True,
    )
    sedf["Unit"].where(
        cond=(sedf["UNITTYPE"] == "BLDG") | (sedf["UNITTYPE"] == "RM"),
        other=sedf["UNITID"],
        inplace=True,
    )
    return sedf

df = pd.read_excel(r"c:\temp\user_data\github.xlsx")
del df['Unnamed: 0'] # column to_excel creates for the index, so drop it. 
df.SHAPE = df.SHAPE.str.replace("'", "\"").apply(json.loads) # convert the string to a dict
df.spatial.set_geometry("SHAPE") # set the Geometry Column

# Run your process
df = addresses_add_fields(df)
df = addresses_data_hygiene(df)
# Clean Up Your Data.
df['Unit'] = '' # this field was causing issues because there is no data to infer, so pandas can't figure it out. 
df = df.convert_dtypes()
df = df.replace(
    {
        pd.NA: None,
        np.NaN: None,
        np.nan: None,
        np.NAN: None,
        pd.NaT: None,
    }
)
# export to feature class. 
spatial = df.spatial
fc = spatial.to_featureclass(r"c:\temp\scratch.gdb\github_xlsx")
print(fc)
FeralCatColonist commented 1 year ago

So, I think this gets to the core of the argument.

If the destination format (feature class) cannot handle an NA type but it can handle a null; what's the harm in converting that? I get that from a programming perspective they are discrete units but likely that distinction doesn't matter to a large population of the user base, i.e. people shoving data into a feature class.

df = df.replace(
    {
        pd.NA: None,
        np.NaN: None,
        np.nan: None,
        np.NAN: None,
        pd.NaT: None,
    }
)
achapkowski commented 1 year ago

The danger is that we make assumptions about your data that are incorrect and downstream we cause harm somewhere else. In your case it might not matter, but let's say you are doing calculations about pollution measurements to detect cancer causing agents. If we get the replacement values incorrect or make generalized assumptions, this can have a large impact on the results.

FeralCatColonist commented 1 year ago

@achapkowski, it seems that having a data sanitize parameter would empower users to solve both of those use-cases.

achapkowski commented 1 year ago

yes, I get what you are saying, but it would be general enough to include. Pandas includes all these tools, so it doesn't make sense to double up on everything.

FeralCatColonist commented 1 year ago

I really think we've come full circle on this; what does the user expect when they push a spatially enabled dataframe to a file-based format: to_featureclass(), to_arrow, to_parquet, etc.?

Now, ideally, should the user know better? Sure. But without additional documentation or error handling, how would the user know:

I think it's easy to say, well, be a better programmer. This is fairly analogous to Recheck your data. But the reality of your user base here is, by and large, not going to be computer science types that necessarily appreciate the difference between different types of nulls or who will have an expert understanding of dataframes generally. They're going to be self-taught, or otherwise lightly trained, and come out of the button-clicker, turned ArcPy user, to API for Python user.

In my opinion, this is a fairly compelling user story to add a few lines of error handling that has shown up in the Esri Ideas forums independently twice. This is a change that lowers the bar to entry, creates appropriate guardrails, and allows users to "stand on the shoulders of giants"-- --which is arguably the ethos of programming. We attempt to spare others the labor of re-solving problems that are well-understood by experienced programmers. People play in the Esri-verse because they just need things to work. Insisting that they become intimately familiar with a wide variety of pandas operations and fully understand the limitations of a variety of destination file types isn't congruent with the ease of use pattern.

nanaeaubry commented 1 year ago

@FeralCatColonist Sorry for taking so long to get back to you. Thank you for your thoughtful comment, we have taken this into consideration for further code enhancements. Your feedback has helped us a lot.

We put in some code for type conversions and this will be reflected at our next release. Please continue to let us know if you see other issues or have other enhancement requests that we can consider.