Esri / arcgis-python-api

Documentation and samples for ArcGIS API for Python
https://developers.arcgis.com/python/
Apache License 2.0
1.9k stars 1.11k forks source link

Reading hosted table in AGOL using .query() returns a dataframe with exactly 1000 less records than are in AGOL hosted table #2139

Closed theisenm12 closed 2 weeks ago

theisenm12 commented 1 month ago

I am trying to read a hosted AGOL table using the script below (.query). When the script runs, saves a csv and then publishes the table, it has exactly 1000 less records than the original hosted table. I don't have any other queries on the table and I don't think there are settings within the hosted table that would cause this. I am using this to compare the existing hosted table to a new table of AGOL members information so I can create an updated dashboard with removed accounts still present (hence, comparing the two tables, dataframes in this case).

Below is the code piece being used: Existing item ID item_id = 'xyz'

Load the existing table from AGOL to check for removed members existing_table_item = gis.content.get(item_id)

Initialize DataFrame either from the existing table or as an empty DataFrame with the same columns as df existing_table_df = (existing_table_item.tables[0].query().df if existing_table_item else pd.DataFrame(columns=df.columns))

Debugging: Check the columns of existing_table_df print("Columns in existing_table_df:", existing_table_df.columns.tolist())

Save the DataFrame to a CSV File file_SaveName = 'Test_Export_Existing_Table_DF.csv' existing_table_df.to_csv(file_SaveName, index=False)

Define metadata for the CSV file csv_properties = { 'title': 'Test Export Existing Table DF', 'type': 'CSV', 'tags': 'AGOL Export, CSV', 'description': 'CSV export of the existing table from AGOL', }

Upload the CSV file to AGOL content csv_item = gis.content.add(item_properties=csv_properties, data=file_SaveName)

For what its worth, the following code has the same result: Access the previous values table table_url = "xyz" previous_values_table = Table(table_url)

Retrieve existing values from the previous values table existing_records = previous_values_table.query(where="1=1", out_fields="*").features

Convert the list of features to a list of dictionaries (attributes) records_dict_list = [feature.attributes for feature in existing_records]

Convert the list of dictionaries to a pandas DataFrame existing_records_df = pd.DataFrame(records_dict_list)

Save the DataFrame to a CSV file file_SaveName = 'Test_Export_Existing_Table_DF_20241022.csv' existing_records_df.to_csv(file_SaveName, index=False)

Define metadata for the CSV file csv_properties = { 'title': 'Test Export Existing Table DF 20241022', 'type': 'CSV', 'tags': 'AGOL Export, CSV', 'description': 'CSV export of the existing table from AGOL', }

Upload the CSV file to AGOL content csv_item = gis.content.add(item_properties=csv_properties, data=file_SaveName)

error:
There is no error message, 

Expected behavior I would expect this piece of code to read the entire hosted table and create a dataFrame consisting of all the records

Platform (please complete the following information):

nanaeaubry commented 3 weeks ago

@theisenm12 I will test this as well but can you: make sure you are using the latest version (2.4.0) and instead of the Content Manager add method you can use the Folder add method.

So instead of: gis.content.add(...) use folder.add(...) this is because Content Manager add has been deprecated in favor of the Folder add.

If you want to add to the root folder you can get this folder by doing: folder = gis.content.folders.get(). If you want a specific folder then you can specify the folder name in the get method.

Please let us know if you can reproduce the issue with these conditions

theisenm12 commented 3 weeks ago

@nanaeaubry Thank you for helping out and for letting me know gis.content.add() has been deprecated.

I am using AGOL Notebook so I thought it should be automatically updated to whatever the latest version is but if not, please let me know how I could go about updating. I am fairly new to this system.

Do I still put the item ID in the parenthesis?

nanaeaubry commented 3 weeks ago

@theisenm12 ArcGIS Online is currently a version behind and should be updating before the end of the year so just hold tight and it will do it on it's own.

If you want to already test the latest version you can do a local install and use local notebooks or an IDE to run your script. Here is some information if that ever interests you: https://developers.arcgis.com/python/latest/guide/install-and-set-up/anaconda/

This is the part of the code that will be deprecated and I have put the updated code below it:

# Upload the CSV file to AGOL content
**csv_item = gis.content.add(item_properties=csv_properties, data=file_SaveName)**
root_folder = gis.content.folders.get()
csv_item = root_folder.add(csv_properties, file=file_SaveName).result()

In the meantime I will try to test and let you know if I find anything

theisenm12 commented 3 weeks ago

@nanaeaubry Thank you! I appreciate it greatly. A coworker also had this problem but he got 1000 extra duplicate features while using this .query function. Not sure how or why. His script still works though so not going to push that too much.

nanaeaubry commented 3 weeks ago

@theisenm12 I have tried your script above with 2.4.0 and cannot reproduce the issue.

I started with this amount of features: image

When getting the dataframe using query I have: image

And then when adding and publishing the data back to AGO I get: image

We did do some work on the query method for 2.4.0 so it might have been fixed through that work. For your coworker getting duplicates that is also odd! I will leave this issue open but encourage you to try with 2.4.0 when you can (either local or when Online releases it). If you continue getting the same error it would be helpful to get some sample data we can test with.

If you cannot provide the data through github then we would have to go through Esri Support. Ill make sure to post back here if I find anything else.

theisenm12 commented 3 weeks ago

@nanaeaubry That's great you are not getting the same issue. I hope it is 2.4.0. I will get with my IT team this week to get the newest release installed (I don't have permissions on my computer).

Do you know why AGOL is so far behind updating to the newest API version?

I won't be able to share my data as it is sensitive. I have a current ticket with ESRI support where they are looking into it as well but I know there is good support here as well so I wanted to spread it out.

nanaeaubry commented 3 weeks ago

@theisenm12 Yes let's leave this open until you can test it on your end.

AGOL is 'behind' with the Python API because we follow the same release as ArcGIS Enterprise and ArcGIS Pro. ArcGIS Online has it's separate release schedule since they are less bound by versioning.

theisenm12 commented 3 weeks ago

@nanaeaubry So if I am using ArcGIS Online Notebook, it is on the same schedule as Enterprise and Pro?

nanaeaubry commented 3 weeks ago

@theisenm12 No, ArcGIS Online Notebooks are released with ArcGIS Online and ArcGIS Enterprise Notebooks are released with ArcGIS Enterprise and Pro Notebooks are released with Pro. Lots of teams to make the magic happen :)

The only way you can control the version you are using yourself is by setting up a local environment with the Python API and running it in local notebooks. Otherwise the version of the Python API will depend on the ArcGIS environment you are in.

Soon as ArcGIS Online updates then the python API will update in Online Notebooks as well.

theisenm12 commented 3 weeks ago

@nanaeaubry Gotcha. I will get my version updated locally and then alter my script to run it local versus from AGOL Notebooks.

theisenm12 commented 3 weeks ago

root_folder = gis.content.folders.get() csv_item = root_folder.add(csv_properties, file=file_SaveName).result()

Hey there,

For the above code, how do I get a specific AGOL folder to work in the .get() function? I plugged in the name of my folder and it spit out the following error. NameError: name 'Account_Management' is not defined

nanaeaubry commented 3 weeks ago

@theisenm12 Normally if the folder exists you should be able to get it back.

For example I have this Folder: image

So I can do:

gis.content.folders.get("Feature_Layer")

And it returns: image

theisenm12 commented 3 weeks ago

@nanaeaubry I am so sorry. The quotes strike again... I completely forgot about using quotations. I am pretty new to coding and api so this is all very helpful. This worked for accessing the specific folder.

Still running into the same issue with 1000 less records, but I have no updated to 2.4.0. I have a meeting set with IT to get this installed tomorrow!

As for running locally, is there anything I should be aware of? I know I will need to provide my username and password to be able to log in, but besides that?

nanaeaubry commented 3 weeks ago

@theisenm12 No worries it happens even to the most experienced :)

The only difference is that when you use a Notebook in ArcGIS Online, you can provide gis = GIS("home") and here you are right it will be gis = GIS(<url>, <username>, <password>)

If you want to avoid doing this each time you can assign a profile to your GIS object and this will save the url, username, and password to your local computer using a third party library (comes already installed with the python api).

So what we do is: gis = GIS("my_org_url.com", "my_username", "secret_password", profile="my_online_profile")

You execute this once and then the next time you want to login you can simply do: gis = GIS(profile="my_online_profile")

To check that you are correctly logged in for both cases you can always run: gis.users.me

For more information on authentication: https://developers.arcgis.com/python/latest/guide/working-with-different-authentication-schemes/ For more information on profiles specifically: https://developers.arcgis.com/python/latest/guide/working-with-different-authentication-schemes/#storing-your-credentials-locally

theisenm12 commented 3 weeks ago

@nanaeaubry what IDE or where are you running local scripts from? I was unable to get my test script to run from my python IDLE so I went ahead and ran it from a notebook within ArcGIS Pro (v3.3.2) and it produced the same result (1000 less records than the present table in AGOL). I believe my IT person said that they were only able to update the ArcGIS Python API to 2.3.0. Is 2.4.0 released to general public or is it only available to some people?

I had another person tell me that it could be from values in the AGOL hosted table but I have done a strenuous search and comparison through that table resulting in no common denominators and different features are being skipped on different runs occasionally.

Anywho, sorry for the novel but I am a bit stumped on this. I am also in contact with ESRI Premium Support and they have not been able to replicate this either but I don't know what system they are running it from.

theisenm12 commented 3 weeks ago

@nanaeaubry Also, what kind of print statement or debugging method did you use to be able to see how many rows were in the dataframe?

nanaeaubry commented 3 weeks ago

@theisenm12 Hmm that is a bit odd. It's always frustrating when others cannot reproduce the issue. I thought of a couple things we can try to see what is happening.

  1. You can try to do return_count_only=True in the query. This will just give back the number of records from the query. This is a pretty reliable parameter. So if you are missing 1000 records from this then the issue might be with the data.
  2. You can similarly use return_ids_only=True and this will return all the ids related to your query. However, you will have to paginate through because this only returns 2000 records at a time. (2000 is default but depends on table settings)

Here is what that would look like:

existing_table = gis.content.get("<your_table_item>").tables[0]
total_count = table.estimates['count'] #total number of features in table
ids = []
while True:
    result = table.query(where="1=1", return_ids_only=True, return_all_records=False, result_offset=len(ids))
    ids.extend(result['objectIds'])
    if len(ids) >= total_count:
        break
print(len(ids))

You can then query the features by ids

object_ids = ",".join(map(str,ids)) #needs to be a string of ids not list
result = table.query(where="1=1", object_ids=object_ids, return_all_records=True)
df = result.sdf

In the code above, total_count is the total number of features that are found in the Table. We use this to paginate and find all the ids of the table. Then, we query with this list of ids so we get back all the results (hopefully).

I am using visual studio code debugger to see the data and the size of the dataframe but if you want to see that in a notebook you can do: df.shape and it will give you a tuple of (rows, columns)

theisenm12 commented 3 weeks ago

@nanaeaubry Good morning Nanae,

I implemented this debugging script running from ArcGIS Pro (v3.3.2) as follows:

from arcgis.gis import GIS from arcgis.features import FeatureLayerCollection import pandas as pd from datetime import datetime import os

gis = GIS("home")

Existing item ID

item_id = 'xyz'

existing_table = gis.content.get(item_id).tables[0] total_count = existing_table.estimates['count'] #total number of features in table ids = [] while True: result = existing_table.query(where="1=1", return_ids_only=True, return_all_records=False, result_offset=len(ids)) ids.extend(result['objectIds']) if len(ids) >= total_count: break print(len(ids))

object_ids = ",".join(map(str,ids)) #needs to be a string of ids not list result = existing_table.query(where="1=1", object_ids=object_ids, return_all_records=True) df = result.sdf

row_count = df.shape[0] print(row_count)

printed outcomes: 3378 (from print(len(ids)) 2378 (from print(row_count)

The outcome of the first print message was 3378 records, the correct amount of records in the table! I think this is good news as it might mean there is nothing wrong with the data.

However, The second piece of the code results in 1000 less records being generated into the dataframe. It seems as if the error is occuring when the dataframe is being created? Have you seen something like that before?

nanaeaubry commented 3 weeks ago

@theisenm12 Hmmm ok last resort is to try to use the arcpy copy features and see if this will fix it. I would suggest this to the support person as well because they would have a workflow for it.

Also if you own the Table item try exporting it out and see how many results you get.

theisenm12 commented 3 weeks ago

@nanaeaubry So I actually ended up finding a solution that worked by happenstance.

I guess for some reason I needed to implement a batch read on the code. I will paste the code below for your awareness.

existing_table = gis.content.get(item_id).tables[0] total_count = existing_table.estimates['count'] print("Estimated total count:", total_count)

Retrieve all object IDs in batches

ids = [] while True: result = existing_table.query(where="1=1", return_ids_only=True, return_all_records=False, result_offset=len(ids)) ids.extend(result['objectIds']) print("Current count of IDs:", len(ids)) # Debug statement if len(ids) >= total_count: break

print("Total object IDs retrieved:", len(ids))

Query in batches to handle potential ArcGIS Online limits on large queries

batch_size = 500 # Adjust batch size if necessary all_rows = []

for i in range(0, len(ids), batch_size): batch_ids = ids[i:i + batch_size] batch_ids_str = ",".join(map(str, batch_ids)) result = existing_table.query(where="1=1", object_ids=batch_ids_str, return_all_records=True) batch_df = result.sdf all_rows.append(batch_df) print(f"Batch {i // batch_size + 1}: Retrieved {batch_df.shape[0]} rows") # Debug each batch

Combine all batches into one DataFrame

existing_table_df = pd.concat(all_rows, ignore_index=True) print("Total row count after batching:", existing_table_df.shape[0])

Final row count in DataFrame

row_count = existing_table_df.shape[0] print("Final row count in DataFrame:", row_count)

This printed out the same final counts that match up. Some of the code confuses me a bit still but it is fixing what I struggling with. I had went down a path of pagination before but I was unable to get it to function. This seems to be similar but works! Woohoo. Thank you for helping me get to this point. If you have any idea on how I could improve this script, I am all ears but I don't want to mess with it too much if it is workin!

nanaeaubry commented 2 weeks ago

@theisenm12 Great I will go ahead and close this issue then. To find the limit of these batches it's the maxRecordCount property on a Feature Layer or Table. Glad you got it working and that is a good path to go down. We do that internally for some cases but not for object ids so you have to do it on your own (just like you are in the code). Nice catch!

theisenm12 commented 2 weeks ago

@nanaeaubry So I went into the admin REST settings on the feature layer and found that the maxRecordCount was set to 1000. I changed it to 5000 and it still did not work properly. I found a few threads of people saying that changing this number doesn't make a difference.

What does changing that maxRecordCount number do?