Closed FeralCatColonist closed 1 year ago
This error doesn't seem to come from our end but we will investigate. Thanks for bringing it up
@FeralCatColonist Do you have arcpy in your environment?
Can you also provide sample data for this so we can test out?
@nanaeaubry, yes this environment is using ArcPy in addition to the ArcGIS API for Python. I can get you that data later this evening, what's a good method for sending?
@FeralCatColonist Can you drop it here? or make a dropbox for it? Otherwise we can figure something out.
We just need a subset or some data that shows this behavior if you can't post it all
@nanaeaubry, I was just digging into it and the issue was with to_featureclass()
and not from_featureclass()
; I didn't have enough signposting in my script previously and I guessed wrong on what function was to blame.
It appears to be the same issue as described here:
https://community.esri.com/t5/arcgis-api-for-python-ideas/support-a-pd-na-to-np-nan-conversion-inside-the/idi-p/1266352/
I'm running a series of data hygiene operations against the data and I've introduced some NA types as part of that process. Here's an export to parquet from that intermediate process-- --I'm unsure if this will preserve the error or not. dirty_addresses.zip
Ok I dug a bit into it and the error message is occurring from the arcpy call being made in the method. So I will pass the message onto them but if you want tracking on progress you have to go through support unfortunately since we don't track that here.
@FeralCatColonist I have alerted the arcpy team to this and they are aware but they suggest you post in their community page here: https://community.esri.com/t5/arcgis-pro-ideas/idb-p/arcgis-pro-ideas/label-name/python
Thanks!
For reference here is the arcpy call we are making: arcpy.CreateFeatureclass_management
@nanaeaubry so the ArcPy would be the one to handle that NA conversion in the resources that the to_featureclass() process is calling?
All done, thank you so much @nanaeaubry https://community.esri.com/t5/arcgis-pro-ideas/arcpy-to-handle-na-types-when-utilizing-to/idi-p/1269036
@nanaeaubry so the ArcPy would be the one to handle that NA conversion in the resources that the to_featureclass() process is calling?
Ah so you want us to fix the conversion of none types as well then? I thought this was an enhancement for the error message only. I will talk to our team and see then
Can you post a raw rows of your data?
What version of the API are you using and what is your export format?
When I do this:
sdf.spatial.to_featureclass(r"c:\temp\scratch.gdb\testaddress")
I do not get an issue with the latest release. All 347,087 records are written out.
@achapkowski, I'm assuming that the problematic rows were dropped upon conversion to_parquet(). How would you recommend I post the raw rows? Python API Version: 2.1.0.2
Zip the raw data and send it in the issue or try and post an example with problem data.
@nanaeaubry can we now use Arrow table to accomplish this? You have tbl = <sedf>.spatial.to_arrow()
, then use arcpy.management.CopyFeatures(tbl, ...)
to create the featureclass. Should be a little faster too.
Thanks for being so patient everyone; here are all the records that cause errors in the transformation process. @achapkowski the total 347,087 is correct for the raw data, you can connect the error-inducing records with the unique field ["Site_NGUID"]
Here is a listing in excel and parquet: github.zip
Further, I've isolated the intermediary step to these two functions responsible for producing the NA Type:
def addresses_add_fields(sedf):
"""This subroutine adds fields from scratch or copies based on existing fields"""
sedf["Addtl_Loc1"] = ""
sedf["Building"] = ""
sedf["Inc_Muni"] = ""
sedf["NCT_Type"] = ""
sedf["Place_Type"] = ""
sedf["Room"] = ""
sedf["Uninc_Comm"] = ""
sedf["Unit"] = ""
sedf["DiscrpAgID"] = "DENCOAREA911.DST.TX.US"
sedf["Nbrhd_Comm"] = sedf["Alias1"].str[:100]
return sedf
def addresses_data_hygiene(sedf):
"""This subroutine performs simple data hygiene operations"""
sedf["Addtl_Loc2"] = (
sedf["Alias1"]
.str.cat(sedf[["Alias2", "Alias3", "Alias4", "Alias5"]], sep=" ", na_rep="",)
.str.strip()
.str[:255]
)
sedf["Building"].mask(
cond=sedf["ALTUNITTYPE"] == "BLDG", other=sedf["ALTUNITID"], inplace=True,
)
sedf["Building"].mask(
cond=sedf["UNITTYPE"] == "BLDG", other=sedf["UNITID"], inplace=True,
)
sedf["GPSX"].mask(
cond=(sedf["GPSY"] == 0) | (sedf["GPSY"] > 40), other=0, inplace=True,
)
sedf["GPSX"].mask(cond=sedf["GPSX"] > 0, other=0, inplace=True)
sedf["GPSY"].mask(cond=sedf["GPSY"] > 40, other=0, inplace=True)
sedf["Inc_Muni"].where(
cond=sedf["MUNICIPALITY"].str.contains("COUNTY", regex=True),
other=sedf["MUNICIPALITY"],
inplace=True,
)
sedf["Room"].mask(
cond=sedf["UNITTYPE"] == "RM", other=sedf["UNITID"], inplace=True,
)
sedf["Uninc_Comm"].mask(
cond=sedf["MUNICIPALITY"].str.contains("COUNTY", regex=True),
other=sedf["MUNICIPALITY"],
inplace=True,
)
sedf["Unit"].where(
cond=(sedf["UNITTYPE"] == "BLDG") | (sedf["UNITTYPE"] == "RM"),
other=sedf["UNITID"],
inplace=True,
)
return sedf
So this was a data cleaning issue.
Unit
column. Pandas could not infer the data type. Below is some code to get it to work:
import sys
import json
import pandas as pd
import numpy as np
from arcgis.gis import GIS
def addresses_add_fields(sedf):
"""This subroutine adds fields from scratch or copies based on existing fields"""
sedf["Addtl_Loc1"] = ""
sedf["Building"] = ""
sedf["Inc_Muni"] = ""
sedf["NCT_Type"] = ""
sedf["Place_Type"] = ""
sedf["Room"] = ""
sedf["Uninc_Comm"] = ""
sedf["Unit"] = ""
sedf["DiscrpAgID"] = "DENCOAREA911.DST.TX.US"
sedf["Nbrhd_Comm"] = sedf["Alias1"].str[:100]
return sedf
def addresses_data_hygiene(sedf):
"""This subroutine performs simple data hygiene operations"""
sedf["Addtl_Loc2"] = (
sedf["Alias1"]
.str.cat(
sedf[["Alias2", "Alias3", "Alias4", "Alias5"]],
sep=" ",
na_rep="",
)
.str.strip()
.str[:255]
)
sedf["Building"].mask(
cond=sedf["ALTUNITTYPE"] == "BLDG",
other=sedf["ALTUNITID"],
inplace=True,
)
sedf["Building"].mask(
cond=sedf["UNITTYPE"] == "BLDG",
other=sedf["UNITID"],
inplace=True,
)
sedf["GPSX"].mask(
cond=(sedf["GPSY"] == 0) | (sedf["GPSY"] > 40),
other=0,
inplace=True,
)
sedf["GPSX"].mask(cond=sedf["GPSX"] > 0, other=0, inplace=True)
sedf["GPSY"].mask(cond=sedf["GPSY"] > 40, other=0, inplace=True)
sedf["Inc_Muni"].where(
cond=sedf["MUNICIPALITY"].str.contains("COUNTY", regex=True),
other=sedf["MUNICIPALITY"],
inplace=True,
)
sedf["Room"].mask(
cond=sedf["UNITTYPE"] == "RM",
other=sedf["UNITID"],
inplace=True,
)
sedf["Uninc_Comm"].mask(
cond=sedf["MUNICIPALITY"].str.contains("COUNTY", regex=True),
other=sedf["MUNICIPALITY"],
inplace=True,
)
sedf["Unit"].where(
cond=(sedf["UNITTYPE"] == "BLDG") | (sedf["UNITTYPE"] == "RM"),
other=sedf["UNITID"],
inplace=True,
)
return sedf
df = pd.read_excel(r"c:\temp\user_data\github.xlsx")
del df['Unnamed: 0'] # column to_excel creates for the index, so drop it.
df.SHAPE = df.SHAPE.str.replace("'", "\"").apply(json.loads) # convert the string to a dict
df.spatial.set_geometry("SHAPE") # set the Geometry Column
# Run your process
df = addresses_add_fields(df)
df = addresses_data_hygiene(df)
# Clean Up Your Data.
df['Unit'] = '' # this field was causing issues because there is no data to infer, so pandas can't figure it out.
df = df.convert_dtypes()
df = df.replace(
{
pd.NA: None,
np.NaN: None,
np.nan: None,
np.NAN: None,
pd.NaT: None,
}
)
# export to feature class.
spatial = df.spatial
fc = spatial.to_featureclass(r"c:\temp\scratch.gdb\github_xlsx")
print(fc)
So, I think this gets to the core of the argument.
If the destination format (feature class) cannot handle an NA type but it can handle a null; what's the harm in converting that? I get that from a programming perspective they are discrete units but likely that distinction doesn't matter to a large population of the user base, i.e. people shoving data into a feature class.
sanitize_columns
parameter? This would be a convenience function for a segment of the user base that are likely self-taught analysts turned programmer.df = df.replace(
{
pd.NA: None,
np.NaN: None,
np.nan: None,
np.NAN: None,
pd.NaT: None,
}
)
The danger is that we make assumptions about your data that are incorrect and downstream we cause harm somewhere else. In your case it might not matter, but let's say you are doing calculations about pollution measurements to detect cancer causing agents. If we get the replacement values incorrect or make generalized assumptions, this can have a large impact on the results.
@achapkowski, it seems that having a data sanitize parameter would empower users to solve both of those use-cases.
yes, I get what you are saying, but it would be general enough to include. Pandas includes all these tools, so it doesn't make sense to double up on everything.
I really think we've come full circle on this; what does the user expect when they push a spatially enabled dataframe to a file-based format: to_featureclass()
, to_arrow
, to_parquet
, etc.?
Now, ideally, should the user know better? Sure. But without additional documentation or error handling, how would the user know:
value
actually means field/columnI think it's easy to say, well, be a better programmer. This is fairly analogous to Recheck your data.
But the reality of your user base here is, by and large, not going to be computer science types that necessarily appreciate the difference between different types of nulls or who will have an expert understanding of dataframes generally. They're going to be self-taught, or otherwise lightly trained, and come out of the button-clicker, turned ArcPy user, to API for Python user.
In my opinion, this is a fairly compelling user story to add a few lines of error handling that has shown up in the Esri Ideas forums independently twice. This is a change that lowers the bar to entry, creates appropriate guardrails, and allows users to "stand on the shoulders of giants"-- --which is arguably the ethos of programming. We attempt to spare others the labor of re-solving problems that are well-understood by experienced programmers. People play in the Esri-verse because they just need things to work. Insisting that they become intimately familiar with a wide variety of pandas operations and fully understand the limitations of a variety of destination file types isn't congruent with the ease of use pattern.
@FeralCatColonist Sorry for taking so long to get back to you. Thank you for your thoughtful comment, we have taken this into consideration for further code enhancements. Your feedback has helped us a lot.
We put in some code for type conversions and this will be reflected at our next release. Please continue to let us know if you see other issues or have other enhancement requests that we can consider.
Is your feature request related to a problem? Please describe. When ingesting records into a spatially enabled dataframe from another source, a warning occurs stating:
This warning/error would be significantly more helpful if it stated the record row number and the column/field name or index. Is the value number supposed to be the column? I don't know and the end user probably doesn't either.
Describe the solution you'd like Enrich the error message to include: