Enhanced Logging when Row cannot be added to Spatially Enabled Dataframe

FeralCatColonist commented 1 year ago

Is your feature request related to a problem? Please describe. When ingesting records into a spatially enabled dataframe from another source, a warning occurs stating:

Could not insert the row because of the error message: value #XX - unsupported type: NAType. Recheck your data.

This warning/error would be significantly more helpful if it stated the record row number and the column/field name or index. Is the value number supposed to be the column? I don't know and the end user probably doesn't either.

sedf = arcgis.features.GeoAccessor.from_featureclass(featureclass)

Describe the solution you'd like Enrich the error message to include:

the record row's number that could not be added to the spatially enabled dataframe
- especially an ObjectID if coming from an enterprise geodatabase or other Esri-specific format.
the field or column name and/or index number to alert the user to the location of the error.

nanaeaubry commented 1 year ago

This error doesn't seem to come from our end but we will investigate. Thanks for bringing it up

nanaeaubry commented 1 year ago

@FeralCatColonist Do you have arcpy in your environment?

Can you also provide sample data for this so we can test out?

FeralCatColonist commented 1 year ago

@nanaeaubry, yes this environment is using ArcPy in addition to the ArcGIS API for Python. I can get you that data later this evening, what's a good method for sending?

nanaeaubry commented 1 year ago

@FeralCatColonist Can you drop it here? or make a dropbox for it? Otherwise we can figure something out.

We just need a subset or some data that shows this behavior if you can't post it all

FeralCatColonist commented 1 year ago

@nanaeaubry, I was just digging into it and the issue was with to_featureclass() and not from_featureclass(); I didn't have enough signposting in my script previously and I guessed wrong on what function was to blame. It appears to be the same issue as described here: https://community.esri.com/t5/arcgis-api-for-python-ideas/support-a-pd-na-to-np-nan-conversion-inside-the/idi-p/1266352/

I'm running a series of data hygiene operations against the data and I've introduced some NA types as part of that process. Here's an export to parquet from that intermediate process-- --I'm unsure if this will preserve the error or not. dirty_addresses.zip

nanaeaubry commented 1 year ago

Ok I dug a bit into it and the error message is occurring from the arcpy call being made in the method. So I will pass the message onto them but if you want tracking on progress you have to go through support unfortunately since we don't track that here.

nanaeaubry commented 1 year ago

@FeralCatColonist I have alerted the arcpy team to this and they are aware but they suggest you post in their community page here: https://community.esri.com/t5/arcgis-pro-ideas/idb-p/arcgis-pro-ideas/label-name/python

Thanks!

For reference here is the arcpy call we are making: arcpy.CreateFeatureclass_management

FeralCatColonist commented 1 year ago

@nanaeaubry so the ArcPy would be the one to handle that NA conversion in the resources that the to_featureclass() process is calling?

FeralCatColonist commented 1 year ago

All done, thank you so much @nanaeaubry https://community.esri.com/t5/arcgis-pro-ideas/arcpy-to-handle-na-types-when-utilizing-to/idi-p/1269036

nanaeaubry commented 1 year ago

@nanaeaubry so the ArcPy would be the one to handle that NA conversion in the resources that the to_featureclass() process is calling?

Ah so you want us to fix the conversion of none types as well then? I thought this was an enhancement for the error message only. I will talk to our team and see then

achapkowski commented 1 year ago

Can you post a raw rows of your data?

achapkowski commented 1 year ago

What version of the API are you using and what is your export format?

When I do this: sdf.spatial.to_featureclass(r"c:\temp\scratch.gdb\testaddress")

I do not get an issue with the latest release. All 347,087 records are written out.

FeralCatColonist commented 1 year ago

@achapkowski, I'm assuming that the problematic rows were dropped upon conversion to_parquet(). How would you recommend I post the raw rows? Python API Version: 2.1.0.2

achapkowski commented 1 year ago

Zip the raw data and send it in the issue or try and post an example with problem data.

hannesatwork commented 1 year ago

@nanaeaubry can we now use Arrow table to accomplish this? You have tbl = <sedf>.spatial.to_arrow(), then use arcpy.management.CopyFeatures(tbl, ...) to create the featureclass. Should be a little faster too.

FeralCatColonist commented 1 year ago

Thanks for being so patient everyone; here are all the records that cause errors in the transformation process. @achapkowski the total 347,087 is correct for the raw data, you can connect the error-inducing records with the unique field ["Site_NGUID"] Here is a listing in excel and parquet: github.zip

Further, I've isolated the intermediary step to these two functions responsible for producing the NA Type:

def addresses_add_fields(sedf):
    """This subroutine adds fields from scratch or copies based on existing fields"""
    sedf["Addtl_Loc1"] = ""
    sedf["Building"] = ""
    sedf["Inc_Muni"] = ""
    sedf["NCT_Type"] = ""
    sedf["Place_Type"] = ""
    sedf["Room"] = ""
    sedf["Uninc_Comm"] = ""
    sedf["Unit"] = ""
    sedf["DiscrpAgID"] = "DENCOAREA911.DST.TX.US"
    sedf["Nbrhd_Comm"] = sedf["Alias1"].str[:100]

    return sedf

def addresses_data_hygiene(sedf):
    """This subroutine performs simple data hygiene operations"""
    sedf["Addtl_Loc2"] = (
        sedf["Alias1"]
        .str.cat(sedf[["Alias2", "Alias3", "Alias4", "Alias5"]], sep=" ", na_rep="",)
        .str.strip()
        .str[:255]
    )
    sedf["Building"].mask(
        cond=sedf["ALTUNITTYPE"] == "BLDG", other=sedf["ALTUNITID"], inplace=True,
    )
    sedf["Building"].mask(
        cond=sedf["UNITTYPE"] == "BLDG", other=sedf["UNITID"], inplace=True,
    )
    sedf["GPSX"].mask(
        cond=(sedf["GPSY"] == 0) | (sedf["GPSY"] > 40), other=0, inplace=True,
    )
    sedf["GPSX"].mask(cond=sedf["GPSX"] > 0, other=0, inplace=True)
    sedf["GPSY"].mask(cond=sedf["GPSY"] > 40, other=0, inplace=True)
    sedf["Inc_Muni"].where(
        cond=sedf["MUNICIPALITY"].str.contains("COUNTY", regex=True),
        other=sedf["MUNICIPALITY"],
        inplace=True,
    )
    sedf["Room"].mask(
        cond=sedf["UNITTYPE"] == "RM", other=sedf["UNITID"], inplace=True,
    )
    sedf["Uninc_Comm"].mask(
        cond=sedf["MUNICIPALITY"].str.contains("COUNTY", regex=True),
        other=sedf["MUNICIPALITY"],
        inplace=True,
    )
    sedf["Unit"].where(
        cond=(sedf["UNITTYPE"] == "BLDG") | (sedf["UNITTYPE"] == "RM"),
        other=sedf["UNITID"],
        inplace=True,
    )
    return sedf

achapkowski commented 1 year ago

So this was a data cleaning issue.

The first problem was the Unit column. Pandas could not infer the data type.
The second problem was all the null values. This is not uncommon, but as a best practice, we try not to mutate/change data that a user provides. In your case it might be helpful, but downstream workflows it could cause problems if dtypes, column names, etc... change randomly.... No one wants that.

Below is some code to get it to work:

import sys

import json
import pandas as pd

import numpy as np
from arcgis.gis import GIS

def addresses_add_fields(sedf):
    """This subroutine adds fields from scratch or copies based on existing fields"""
    sedf["Addtl_Loc1"] = ""
    sedf["Building"] = ""
    sedf["Inc_Muni"] = ""
    sedf["NCT_Type"] = ""
    sedf["Place_Type"] = ""
    sedf["Room"] = ""
    sedf["Uninc_Comm"] = ""
    sedf["Unit"] = ""
    sedf["DiscrpAgID"] = "DENCOAREA911.DST.TX.US"
    sedf["Nbrhd_Comm"] = sedf["Alias1"].str[:100]

    return sedf

def addresses_data_hygiene(sedf):
    """This subroutine performs simple data hygiene operations"""
    sedf["Addtl_Loc2"] = (
        sedf["Alias1"]
        .str.cat(
            sedf[["Alias2", "Alias3", "Alias4", "Alias5"]],
            sep=" ",
            na_rep="",
        )
        .str.strip()
        .str[:255]
    )
    sedf["Building"].mask(
        cond=sedf["ALTUNITTYPE"] == "BLDG",
        other=sedf["ALTUNITID"],
        inplace=True,
    )
    sedf["Building"].mask(
        cond=sedf["UNITTYPE"] == "BLDG",
        other=sedf["UNITID"],
        inplace=True,
    )
    sedf["GPSX"].mask(
        cond=(sedf["GPSY"] == 0) | (sedf["GPSY"] > 40),
        other=0,
        inplace=True,
    )
    sedf["GPSX"].mask(cond=sedf["GPSX"] > 0, other=0, inplace=True)
    sedf["GPSY"].mask(cond=sedf["GPSY"] > 40, other=0, inplace=True)
    sedf["Inc_Muni"].where(
        cond=sedf["MUNICIPALITY"].str.contains("COUNTY", regex=True),
        other=sedf["MUNICIPALITY"],
        inplace=True,
    )
    sedf["Room"].mask(
        cond=sedf["UNITTYPE"] == "RM",
        other=sedf["UNITID"],
        inplace=True,
    )
    sedf["Uninc_Comm"].mask(
        cond=sedf["MUNICIPALITY"].str.contains("COUNTY", regex=True),
        other=sedf["MUNICIPALITY"],
        inplace=True,
    )
    sedf["Unit"].where(
        cond=(sedf["UNITTYPE"] == "BLDG") | (sedf["UNITTYPE"] == "RM"),
        other=sedf["UNITID"],
        inplace=True,
    )
    return sedf

df = pd.read_excel(r"c:\temp\user_data\github.xlsx")
del df['Unnamed: 0'] # column to_excel creates for the index, so drop it. 
df.SHAPE = df.SHAPE.str.replace("'", "\"").apply(json.loads) # convert the string to a dict
df.spatial.set_geometry("SHAPE") # set the Geometry Column

# Run your process
df = addresses_add_fields(df)
df = addresses_data_hygiene(df)
# Clean Up Your Data.
df['Unit'] = '' # this field was causing issues because there is no data to infer, so pandas can't figure it out. 
df = df.convert_dtypes()
df = df.replace(
    {
        pd.NA: None,
        np.NaN: None,
        np.nan: None,
        np.NAN: None,
        pd.NaT: None,
    }
)
# export to feature class. 
spatial = df.spatial
fc = spatial.to_featureclass(r"c:\temp\scratch.gdb\github_xlsx")
print(fc)

FeralCatColonist commented 1 year ago

So, I think this gets to the core of the argument.

Is it better to drop records completely or to swap out a null-ish type for another null-ish type?

If the destination format (feature class) cannot handle an NA type but it can handle a null; what's the harm in converting that? I get that from a programming perspective they are discrete units but likely that distinction doesn't matter to a large population of the user base, i.e. people shoving data into a feature class.

What about exposing an option to sanitize those values similar to the sanitize_columns parameter? This would be a convenience function for a segment of the user base that are likely self-taught analysts turned programmer.

df = df.replace(
    {
        pd.NA: None,
        np.NaN: None,
        np.nan: None,
        np.NAN: None,
        pd.NaT: None,
    }
)

achapkowski commented 1 year ago

The danger is that we make assumptions about your data that are incorrect and downstream we cause harm somewhere else. In your case it might not matter, but let's say you are doing calculations about pollution measurements to detect cancer causing agents. If we get the replacement values incorrect or make generalized assumptions, this can have a large impact on the results.

FeralCatColonist commented 1 year ago

@achapkowski, it seems that having a data sanitize parameter would empower users to solve both of those use-cases.

achapkowski commented 1 year ago

yes, I get what you are saying, but it would be general enough to include. Pandas includes all these tools, so it doesn't make sense to double up on everything.

FeralCatColonist commented 1 year ago

I really think we've come full circle on this; what does the user expect when they push a spatially enabled dataframe to a file-based format: to_featureclass(), to_arrow, to_parquet, etc.?

I would contend that the user does not expect these functions to drop rows containing data. I certainly did not expect that.
- if this is the API's preferred default method of handling data that it does not expect, then the error reporting should be better
- in your example above, I think that an analysis would be equally affected if certain values were dropped which would also skew the results and have unintended downstream consequences
- when I first developed this data hygiene process, that data didn't contain errors that would trigger that warning, so this issue remained unnoticed as it had already been put into production and automated
there is precedent already in these functions for managing the user's data proactively to avoid limitations in the destination file type:
- sanitizing NA types would be congruent with also managing user data in response to the destination file type limitations

Now, ideally, should the user know better? Sure. But without additional documentation or error handling, how would the user know:

the term value actually means field/column
which row to check for their bad data
resolving NA types-- --or even that NA type is a limitation
the preference is to drop data that doesn't fit

I think it's easy to say, well, be a better programmer. This is fairly analogous to Recheck your data. But the reality of your user base here is, by and large, not going to be computer science types that necessarily appreciate the difference between different types of nulls or who will have an expert understanding of dataframes generally. They're going to be self-taught, or otherwise lightly trained, and come out of the button-clicker, turned ArcPy user, to API for Python user.

In my opinion, this is a fairly compelling user story to add a few lines of error handling that has shown up in the Esri Ideas forums independently twice. This is a change that lowers the bar to entry, creates appropriate guardrails, and allows users to "stand on the shoulders of giants"-- --which is arguably the ethos of programming. We attempt to spare others the labor of re-solving problems that are well-understood by experienced programmers. People play in the Esri-verse because they just need things to work. Insisting that they become intimately familiar with a wide variety of pandas operations and fully understand the limitations of a variety of destination file types isn't congruent with the ease of use pattern.

nanaeaubry commented 1 year ago

@FeralCatColonist Sorry for taking so long to get back to you. Thank you for your thoughtful comment, we have taken this into consideration for further code enhancements. Your feedback has helped us a lot.

We put in some code for type conversions and this will be reflected at our next release. Please continue to let us know if you see other issues or have other enhancement requests that we can consider.

Esri / arcgis-python-api

Enhanced Logging when Row cannot be added to Spatially Enabled Dataframe #1494