Index error during summarize step of staging

PermafrostDiscoveryGateway / viz-staging

PDG Visualization staging pipeline

Apache License 2.0

2 stars 1 forks source link

Index error during summarize step of staging #12

Closed robyngit closed 1 year ago

robyngit commented 1 year ago

Originally posted by @julietcohen in https://github.com/PermafrostDiscoveryGateway/viz-workflow/issues/10#issuecomment-1322823612

Reproducing staging error versus reproducing successful staging with different data inputs

Robyn found a way to fix to the CRS mismatch error mentioned in the first comment of this issue. See her solution (also linked in first comment) includes changing the TMS to WGS1984Quad instead of using WorldCRS84Quad.

However, I was able to run the staging step on a very small data subset (2 files that overlap spatially, provided by Robyn) without changing the TMS. Additionally, I reproduced the unique error reported above (IndexError: index 0 is out of bounds for axis 0 with size 0) with the exact same config and python environment, but with different input files. Both these workflows were run in parallel with parsl and process geopackages.

The error reported above that does not mention CRSs (IndexError: index 0 is out of bounds for axis 0 with size 0) can be found in this notebook. Staging summary for this file is error_staging_summary.csv.
The successful staging with the "old" TMS WorldCRS84Quad can be found in this notebook. Staging summary for this file is staging_summary.csv.
links were made with commit ID's so should persist even as documents are updated

The config and list of installed packages for these workflows is included within these notebooks just below the import chunk, as well as the following:

{
    "simplify_tolerance": 0.0001,
    "tms_id": "WorldCRS84Quad",
    "z_range": [0, 11],
    "statistics": [
        {
            "name": "polygon_count",
            "weight_by": "count",
            "property": "centroids_per_pixel",
            "aggregation_method": "sum",
            "resampling_method": "sum",
            "val_range": [0, null],
            "nodata_val": 0,
            "nodata_color": "#ffffff00",
            "palette": ["#d9c43f", "#d93fce"]
        },
        {
            "name": "coverage",
            "weight_by": "area",
            "property": "area_per_pixel_area",
            "aggregation_method": "sum",
            "resampling_method": "average",
            "val_range": [0, 1],
            "nodata_val": 0,
            "nodata_color": "#ffffff00",
            "palette": ["#d9c43f", "#d93fce"]
        }
    ],
    "deduplicate_at": ["staging"],
    "deduplicate_method": "neighbor",
    "deduplicate_keep_rules": [["staging_filename", "larger"]],
    "deduplicate_overlap_tolerance": 0.1,
    "deduplicate_overlap_both": false,
    "deduplicate_centroid_tolerance": null
  }

Package              Version
-------------------- -----------
addict               2.4.0
affine               2.3.1
asttokens            2.1.0
attrs                22.1.0
backcall             0.2.0
bcrypt               4.0.1
certifi              2022.9.24
cffi                 1.15.1
charset-normalizer   2.1.1
click                8.1.3
click-plugins        1.1.1
cligj                0.7.2
coloraide            0.18.1
colormaps            0.3
comm                 0.1.0
ConfigArgParse       1.5.3
contourpy            1.0.6
cryptography         38.0.3
cycler               0.11.0
Cython               0.29.32
dash                 2.7.0
dash-core-components 2.0.0
dash-html-components 2.0.0
dash-table           5.0.0
debugpy              1.6.3
decorator            5.1.1
dill                 0.3.6
entrypoints          0.4
executing            1.2.0
fastjsonschema       2.16.2
filelock             3.8.0
Fiona                1.8.22
Flask                2.2.2
fonttools            4.38.0
geopandas            0.11.1
globus-sdk           3.14.0
idna                 3.4
importlib-metadata   5.0.0
ipykernel            6.18.0
ipython              8.6.0
ipywidgets           8.0.2
itsdangerous         2.1.2
jedi                 0.18.1
Jinja2               3.1.2
joblib               1.2.0
jsonschema           4.17.0
jupyter_client       7.4.7
jupyter_core         5.0.0
jupyterlab-widgets   3.0.3
kiwisolver           1.4.4
laspy                2.3.0
llvmlite             0.39.1
lz4                  4.0.2
MarkupSafe           2.1.1
matplotlib           3.6.2
matplotlib-inline    0.1.6
morecantile          3.2.1
munch                2.5.0
nbformat             5.5.0
nest-asyncio         1.5.6
numba                0.56.4
numpy                1.22.4
open3d               0.16.0
packaging            21.3
pandas               1.5.1
paramiko             2.12.0
parsl                2022.11.14
parso                0.8.3
pdgpy3dtiles         0.0.1
pdgraster            0.1.0
pdgstaging           0.1.0
pexpect              4.8.0
pickleshare          0.7.5
Pillow               9.3.0
pip                  22.2.2
platformdirs         2.5.4
plotly               5.11.0
prompt-toolkit       3.0.33
psutil               5.9.4
psycopg2-binary      2.9.5
ptyprocess           0.7.0
pure-eval            0.2.2
pycparser            2.21
pydantic             1.10.2
pygeos               0.13
Pygments             2.13.0
PyJWT                2.6.0
PyNaCl               1.5.0
pyparsing            3.0.9
pyproj               3.4.0
pyquaternion         0.9.9
pyrsistent           0.19.2
python-dateutil      2.8.2
pytz                 2022.6
PyYAML               6.0
pyzmq                24.0.1
rasterio             1.3.4
requests             2.28.1
Rtree                0.9.7
scikit-learn         1.1.3
scipy                1.9.3
setproctitle         1.3.2
setuptools           65.5.0
Shapely              1.8.5.post1
six                  1.16.0
snuggs               1.4.7
stack-data           0.6.1
tblib                1.7.0
tenacity             8.1.0
threadpoolctl        3.1.0
tornado              6.2
tqdm                 4.64.1
traitlets            5.5.0
triangle             20220202
typeguard            2.13.3
typing_extensions    4.4.0
urllib3              1.26.12
viz-3dtiles          0.0.1
wcwidth              0.2.5
Werkzeug             2.2.2
wheel                0.37.1
widgetsnbextension   4.0.3
zipp                 3.10.0

robyngit commented 1 year ago

The error IndexError: index 0 is out of bounds for axis 0 with size 0 occurs during the summarize step when the stager attempts to log the event here. I was able to reproduce the error that Juliet found using her notebook. Here is a more minimal example (uses Ingmar's data on datateam):

Minimal example code

```python from pathlib import Path import pdgstaging stager = pdgstaging.TileStager({ "simplify_tolerance": 0.0001, "tms_id": "WorldCRS84Quad", "z_range": [0, 11], "deduplicate_at": ["staging"], "deduplicate_method": "neighbor", "deduplicate_keep_rules": [["staging_filename", "larger"]], "deduplicate_overlap_tolerance": 0.1, "deduplicate_overlap_both": False, "deduplicate_centroid_tolerance": None }) base_dir = Path('/home/pdg/data/nitze_lake_change/data_sample_2022-09-09') data_paths = [str(p) for p in base_dir.glob('**/lake_change.gpkg')] for path in data_paths: stager.stage(path) ```

The error seems to occur following some deduplication events, and it looks like the list of tiles used for the summary is empty for some reason.

julietcohen commented 1 year ago

Troubleshooting Update

Error persists:

with "new" TMS WGS1984Quad
with new environment & imported pdgstaging with #egg=pdgstaging extension at end of GitHub URL

Error does not occur:

when edited viz-staging/TileStager.py: commented out the logging event that the error points to:

# Log the total time to create the summary
    # logger.info(
    #     'Summarized Tile(z={}, x={}, y={}) in {}'
    #     .format(
    #         tile_props[0]['tile_z'],
    #         tile_props[0]['tile_x'],
    #         tile_props[0]['tile_y'],
    #         (datetime.now() - start_time)
    #     )
    # )

19088 staged tiles created
means the error is within the list of tile properties

when edited viz-staging/TileStager.py: commented out the logging event that the error points to (see above) as well as modified the following code to print the properties for each tile within the log, rather than just creating them:

def get_tile_properties(self, tile):
    """
        For a given TMS tile, return certain properties of that tile as a
        dict for summary purposes

        Parameters
        ----------
        tile : morecantile.commons.Tile
            The TMS tile to get properties for
    """
    bounds = self.tiles.get_bounding_box(tile)

    properties_dict = {
        'tile_x': tile.x,
        'tile_y': tile.y,
        'tile_z': tile.z,
        'tile_top': bounds['top'],
        'tile_right': bounds['right'],
        'tile_bottom': bounds['bottom'],
        'tile_left': bounds['left']
    }

    # print statements for troubleshooting indexing error:
    logger.info(f'Properties of tile {tile} to be recorded in staging_summary.csv: {properties_dict}')

    return properties_dict

within log.log, the print statement seemingly worked for all tiles:

means the error is in the step that takes these tile property dictionaries as inputs, and makes what I believe is one combined dictionary called self.get_all_tile_properties, so I added a logging event immediately following creating that dictionary :

self.get_all_tile_properties = np.vectorize(
            self.get_tile_properties, otypes=[dict]
        )

logger.info(f'Combined tile properties dictionary: {self.get_all_tile_properties}')

Which resulted in 4 new print statements in the log.log, 3 of which are in the following snippet:

I am not sure how to interpret the object self.get_all_tile_properties, which printed as 0x7f1b40d46610 and similar codes.

julietcohen commented 1 year ago

I added a step within viz-staging/TileStager.py (here) so that the length of the list of properties is printed in the log.log, as well as the first 10 properties, which should help us determine how many lists are empty. I also tried to assign NA to any empty property lists, so that the logging statement can print successfully for those tiles:

# create list object of all tile properties to be summarized
        tile_props = self.get_all_tile_properties(tiles)

        # print the first 10 tile properties 
        logging.info(f'First 10 summarized tile properties: {tile_props[:10]}')
        # print how many properties there are:
        logging.info(f'Length of summarized tile properties list : {len(tile_props)}')

        if (len(tile_props) == 0):
            tile_props[0]['tile_z'] = 'NA'
            tile_props[0]['tile_x'] = 'NA'
            tile_props[0]['tile_y'] = 'NA'

According to the log.log, this last step errored because I am assigning a string to a list index that doesn't exist, when really what I want to do is populate that list's first index with NAs.

I wonder if instead of trying to populate empty property lists, I should instead print which files have empty lists, and investigate why. The print statement Length of summarized tile properties list : 0 only occurred twice in the log.log!

robyngit commented 1 year ago

Very nice debugging, @julietcohen 🎉

I am not sure how to interpret the object self.get_all_tile_properties, which printed as 0x7f1b40d46610 and similar codes.

get_all_tile_properties is a method that is part of the TileStager class. In python, the word "self" is used to refer to an instance of a class. So in self.get_all_tile_properties, self is the specific TileStager you configured for this workflow. Here is another way you can see the get_all_tile_properties method:

my_tile_stager = pdgstaging.TileStager("my_config_file.json")
my_tile_stager.get_all_tile_properties

The code 0x7f1b40d46610 is the memory address of that method - the location where python is storing that method.

So get_all_tile_properties is a method, not a dict. The method is created and set on the class here. It is a vectorized version of another TileStager method: get_tile_properties. The difference between the two is that get_all_tile_properties takes a list of tiles and gives back a list of dicts, whereas get_tile_properties takes just one tile and gives back a single dict.

In short, this logger.info(f'Combined tile properties dictionary: {self.get_all_tile_properties}') is just logging the location of a method. If you want to log the output of that method, you will probably want to add a log statement just below where the method is called. logger.info(f'Combined tile properties dictionary: {tile_props}')

julietcohen commented 1 year ago

Thank you for the explanation, @robyngit! That makes sense.

I did end up logging the output of the method by printing tile_props along with what should be the tiles identifier, tiles:

tile_props = self.get_all_tile_properties(tiles)

# print how many properties there are:
logging.info(f'Length of summarized tile properties list: {len(tile_props)} for tiles {tiles}')
# print the first 10 tile properties 
logging.info(f'First 10 summarized tile properties: {tile_props[:10]} for tiles {tiles}')

But the log.log statements showed that all empty lists of properties were assigned to empty tiles objects, which is not surprising:

This means that I am still on the hunt for the z, x, and y identifiers for the tiles that have no properties.

julietcohen commented 1 year ago

Implemented the following checks into TileStager.py at the very start of summarize():

        if gdf is None:
            logging.info(f'WARNING: Geodataframe is None when fed into summarize().')
            return None

        # print gdf columns and their types
        logger.info(f'Geodataframe has columns {gdf.dtypes}')

        # check if gdf has only columns with 0 rows
        if (gdf.shape[0]==0):
            logging.info(f'WARNING: Geodataframe has 0 rows when fed into summarize().')

        # check if any NA's are present in the gdf
        if (gdf.isnull().sum().sum() != 0):
            logger.info(f'WARNING: Geodataframe has NaN values in {gdf.columns[gdf.isnull().any()]} when fed into summarize()')

This resulted in many warnings in the log.log that indicate there are NaN values in just the tile identifier column: [INFO] pdgstaging.TileStager: WARNING: Geodataframe has NaN values in Index(['staging_tile'], dtype='object') when fed into summarize()

julietcohen commented 1 year ago

Included checks for NaN within the tiles series both when it is originally saved from the properties here as well as after it is added "back" to the data here. The tiles series only has NaN values after it is added back to the data, before it is fed into summarize().

Example of this logged event:

2022-11-23 11:23:11,015 [INFO] pdgstaging.TileStager: Tiles that were saved to data are 65355    (506, 334, 11)
65433    (506, 334, 11)
65482    (506, 334, 11)
0                   NaN
1                   NaN
2                   NaN
Name: staging_tile, dtype: object. Does this series have NaN Values?:True

julietcohen commented 1 year ago

First Check

Inserted check after deduplication during staging for number of rows within the data geodataframe, and number of rows within the series object that was saved before deduplication was executed.

            if os.path.isfile(tile_path):
                # If the file exists and config is set to deduplicate, then
                # open the file, append the new data, and identify duplicates.
                # Remove the data if the config is set to remove duplicates
                # during staging. Overwrite existing file.

                # count how many times the file exists
                logger.info('File exists, so starting deduplication if config is set to deduplicate at `staging`')

                dedup_method = self.config.get_deduplication_method()
                if dedup_method is not None:
                    data = self.combine_and_deduplicate(data, tile_path)

                    # check if deduplicating step during staging changes the number of rows of `data`
                    len_data = len(data)
                    len_tiles = len(tiles)
                    are_equal = len(data) == len(tiles)
                    logger.info(f'After deduplication, # rows in `data` gdf is {len_data}, & NaN values in cols:{data.columns[data.isnull().any()]}, & # rows in tiles series are: {len_tiles}, do lengths match?:{are_equal}')

                # If the file exists and config is not set to deduplicate, then
                # just append the new data to the existing file.
                else:
                    mode = 'a'

Examples of logged events:

[INFO] pdgstaging.TileStager: After deduplication, # rows in `data` gdf is 5, & NaN values in cols:Index([], dtype='object'), & # rows in tiles series are: 1, do lengths match?:False

[INFO] pdgstaging.TileStager: After deduplication, # rows in `data` gdf is 23, & NaN values in cols:Index([], dtype='object'), & # rows in tiles series are: 12, do lengths match?:False

Results

run #	total # logged events for this check	# logged events where lengths did not match	# logged events where lengths did match
1	850	546	304
2	5783	497	5286
3	850	546	304

I believe the number of logged events (850 for 2 out of 3 runs) for this check means there were that many deduplication events with this sample data, because the config is set to deduplicate at staging and the file existed that many times.

How could there be such a drastically different amount of deduplication events with the same input data, config, etc.? My suggested answer: I cleared and shutdown my parsl engine after the second run cause it took so long and gave such different numbers compared to the first run, which likely was the right thing to do since run 3 gave the same stats as run 1!!! That definitely helped my sanity.

Second Check

Inserted check similar to the one I documented in the last comment after deduplication that checks the number of NaN values in the original tiles object and the "new" tiles series in the data geodataframe.

            try:
                # Ignore the FutureWarning raised from geopandas issue 2347
                with warnings.catch_warnings():
                    warnings.simplefilter('ignore', FutureWarning)
                    data.to_file(tile_path, mode=mode)

                # Record what was saved
                data[self.props['tile']] = tiles

                tiles_in_data = data[self.props['tile']]

                # log if the tiles series has NaN values after we add it to data that is about to be fed into summarize():
                logger.info(f'Original tiles series was {tiles}, do original tiles have NaNs?:{tiles.hasnans}. Tiles series that was saved to `data` is {tiles_in_data}. Does this new series have NaNs?:{tiles_in_data.hasnans}, did the absence/presence of NaNs remain consistent?:{tiles.hasnans==tiles_in_data.hasnans}')

                self.summarize(data)
            finally:
                # Track the end time, the total time, and the number of vectors
                logger.info(
                    f'Saved {tile_path} in {datetime.now() - start_time}'
                )
                self.__release_file(lock)

Examples of logged events:

[INFO] pdgstaging.TileStager: Original tiles series was 41432    (476, 261, 11)
41492    (476, 261, 11)
41615    (476, 261, 11)
Name: staging_tile, dtype: object, do original tiles have NaNs?:False. Tiles series that was saved to `data` is 41432    (476, 261, 11)
41492    (476, 261, 11)
41615    (476, 261, 11)
0                   NaN
4                   NaN
Name: staging_tile, dtype: object. Does this new series have NaNs?:True, did the absence/presence of NaNs remain consistent?:False

[INFO] pdgstaging.TileStager: Original tiles series was 45037    (476, 270, 11)
45055    (476, 270, 11)
45079    (476, 270, 11)
45090    (476, 270, 11)
45100    (476, 270, 11)
45114    (476, 270, 11)
Name: staging_tile, dtype: object, do original tiles have NaNs?:False. Tiles series that was saved to `data` is 45037    (476, 270, 11)
45055    (476, 270, 11)
45079    (476, 270, 11)
45090    (476, 270, 11)
45100    (476, 270, 11)
45114    (476, 270, 11)
0                   NaN
1                   NaN
2                   NaN
3                   NaN
4                   NaN
5                   NaN
6                   NaN
7                   NaN
8                   NaN
9                   NaN
12                  NaN
Name: staging_tile, dtype: object. Does this new series have NaNs?:True, did the absence/presence of NaNs remain consistent?:False

The number of NaNs present in the "new" tiles series that is saved to data does no seem to be consistent 😒

Results

run #	total # logged events for this check	# logged events where no `NaN`s exist in the original `tiles` series, but `NaN`s exist in the "new" tiles series	# logged events where no `NaN`'s exist in the original `tiles` series and no `NaN`'s exist in the "new" tiles series, either
1	19938	850	19088
2	19938	5436	14502
3	19938	850	19088

14502 + 5436 = 19938 !!! 19088 + 850 = 19938 !!!

Note that runs 1 & 3 took ~13 min while run 2 took ~40 min.

KastanDay commented 1 year ago

This is fascinating debugging. Over the holidays I realized detailed logging (as you've show here) will be key to ensuring every single tile is accounted for at every stage in the visualization pipeline.

As you're doing this analysis, maybe keep in mind if we should further improve logging detail to ensure 100% correct outputs.

julietcohen commented 1 year ago

Thanks Kastan, that's a good point. I agree that integrating more detailed logging statements into the deduplication step will be helpful if certain criteria are met (like presence of NaNs), and maybe uses the presence of that logged message to produce a more helpful error message that pinpoints the issue rather than a general indexing error.

Does the source of the error occur within `combine_and_deduplicate()` ?

Within the function combine_and_deduplicate(), we call polygon_prop(), which "Gets the configured property name for new properties created by the tiling process", which has just 1 parameter besides self: the property that we are interested in getting. In the documentation for that parameter, the listed options do not include 'duplicated', which is what we input for that parameter here when we drop duplicated polygons. Rather, the listed options include ‘centroid_x’ or ‘centroid_y’ or ‘area’ or ‘tile’ or ‘centroid_tile’ or ‘filename’ or ‘identifier’ or ‘centroid_within_tile’.

Potential Explanation 1: Perhaps this function's documentation simply omits the 'duplicated' option for this parameter, when in fact it is an option for this parameter. To determine if this is true, maybe we should first print the options for this parameter within log.log. The function uses the python dictionary get() method ~~which only has 1 required argument: keyname: The keyname of the item you want to return the value from, so we are returning the value from duplicated.~~ (update: get() is pulled from here.)

Potential Explanation 2: We cannot return the value for duplicated because it does not exist, and we did not include the second optional argument for get() which is value, which allows us to return a value besides 'None' if the key does not exist, so this was overlooked. (update: staging_duplicated is indeed a column present in gdf

julietcohen commented 1 year ago

Inserted log statement during deduplication here to print the contents of the column staging_duplicated, which seems to be the data that flags each polygon as a duplicate or not a duplicate.

Also inserted log statement to check for columns and number of rows present before and after deduplication (meaning before and after gdf = gdf[~gdf[prop_duplicated]])

        # drop duplicated polygons, if config is set to deduplicate here
        if self.config.deduplicate_at('staging'):

            prop_duplicated = self.config.polygon_prop('duplicated')

            # contents of column
            contents = gdf['staging_duplicated']

            logger.info(f'Dropping duplicated polygons based on the column {prop_duplicated}. Contents of this column are {contents}')

            # define cols before dedup
            bef_cols = gdf.columns
            # define len before dedup
            bef_len = len(gdf)

            if prop_duplicated in gdf.columns:
                gdf = gdf[~gdf[prop_duplicated]]

            # define cols after dedup
            af_cols = gdf.columns
            # define len after dedup
            af_len = len(gdf)

        logger.info(
            f'Finished deduplication in {datetime.now() - dedup_start_time}. '
            f'Before deduplication, columns present in gdf are {bef_cols}. After deduplication,columns present in gdf are {af_cols}. Do cols remain same?:{bef_cols==af_cols} '
            f'Before deduplication, length of gdf is {bef_len} After deduplication, length of gdf is {af_len}. Does length remain same?:{bef_len==af_len} '
        )

        return gdf

Examples of logged events for the contents of the column staging_duplicated:

[INFO] pdgstaging.TileStager: Dropping duplicated polygons based on the column staging_duplicated. Contents of this column are 2262    False
2305    False
2382    False
2469    False
2470    False
2474    False
2529    False
2546    False
2652    False
0       False
1       False
2       False
3       False
4       False
2326     True
2367     True
2371     True
2492     True
2551     True

[INFO] pdgstaging.TileStager: Dropping duplicated polygons based on the column staging_duplicated. Contents of this column are 0       False
1       False
2       False
4685     True
4695     True
4855     True

The columns remained the same for 850 files (all those that are deduplicated)
Number of rows remains the same for 15/850 files.
Number of rows changed for 835/850 files.

julietcohen commented 1 year ago

Checked out the contents of the staging_duplicated column after deduplication as well (here, within the function combine_and_deduplicate()). It seems that the removed rows are those that are TRUE which is what the code is supposed to do.

        # drop duplicated polygons, if config is set to deduplicate here
        if self.config.deduplicate_at('staging'):

            prop_duplicated = self.config.polygon_prop('duplicated')

            # contents of column before dedup
            bef_contents = gdf['staging_duplicated']
            # len before dedup
            bef_len = len(gdf)

            if prop_duplicated in gdf.columns:
                gdf = gdf[~gdf[prop_duplicated]]

            # contents of column after dedup
            af_contents = gdf['staging_duplicated']
            # define len after dedup
            af_len = len(gdf)

        logger.info(
            f'Finished deduplication in {datetime.now() - dedup_start_time}. '
            f'Before deduplication, length of gdf is {bef_len} After deduplication, length of gdf is {af_len}. Does length remain same?:{bef_len==af_len} '
            f'Before deduplication, contents of staging_duplicated column is {bef_contents}. After deduplication, contents of staging_duplicated column is {af_contents}.'
        )

        return gdf

Example of logged events:

[INFO] pdgstaging.TileStager: Finished deduplication in 0:00:00.090229. Before deduplication, length of gdf is 10 After deduplication, length of gdf is 5. Does length remain same?:False Before deduplication, contents of staging_duplicated column is 0        False
1        False
2        False
3        False
4        False
16724     True
16752     True
16753     True
16763     True
16764     True
Name: staging_duplicated, dtype: bool. After deduplication, contents of staging_duplicated column is 0    False
1    False
2    False
3    False
4    False
Name: staging_duplicated, dtype: bool.

That means that we can probably narrow down that the error is occurring:

still within the function save_tiles()
not within the function combine_and_deduplicate() (which is called within save_tiles())
before summarize()
perhaps it has to do with indexing, still seems to have to do with NA values

julietcohen commented 1 year ago

Removed the code the suppresses geopandas warning 2347 here.

try:
                # Ignore the FutureWarning raised from geopandas issue 2347
                # with warnings.catch_warnings():
                #     warnings.simplefilter('ignore', FutureWarning)
                #     data.to_file(tile_path, mode=mode)

                data.to_file(tile_path, mode=mode)

No printed warning within my notebook or log.log. Based on the documentation of the warning here, it seems like this has to do with saving tiles to .shp format, which does not happen within the staging step since we save to .gpkg.

julietcohen commented 1 year ago

Review

Going back to taking a closer look at the presence of NaN values in the column data[self.props['tile']] that is manipulated within save_tiles() starting here and then assigned back to the column of the data gdf here.

After saving the "original" tile series to the object tiles, the column data[self.props['tile']] is first manipulated by converting it to string type.
Then a new column data[self.props['centroid_tile']] is created from that, so they are identical for now.
Then deduplication is executed with combine_and_deduplicate(), which I think is done correctly based on my logging experimentation above.
Then the column data[self.props['tile']] is changed to be the original tiles series (which is not str) so it is no longer identical to the column data[self.props['centroid_tile']]. It needs to be not str because when we reference x, y, and z for the summary, those cannot be pulled from a string.

In my earlier logging experimentation, I found that NaN values were present not in the original tiles series but in the final version of that column data[self.props['tile']].

Inserted "intermediate" logging

Now I inserted an intermediate check for NaN values in case they appear in the column after it is converted to string but before it is replaced by the original tiles series. The logging statements show that there are no NaN values at that intermediate step, so they only appear starting here, as documented above.

Proposed solution: move assignment of tiles series to post-deduplication

When we remove deduplicated polygons, we remove all rows in the gdf with True for the duplicated attribute. Because we remove rows after we assign the tiles series, the tiles series has more rows because it represents the polygons before some were removed. So if we assign the tiles series after we remove the polygons, it should reflect only those that were not duplicates. Since the column data[self.props['centroid_tile']] is created from the tiles object, but it undergoes deduplication (unlike the tiles series object) move the assignment of tiles object to after deduplication, using the centroid_tile column.

I want to create the tiles column from the data[self.props['centroid_tile']] after deduplication, but it must be not a string. By printing data[self.props['tile']].dtypes before it is converted to a string, I just got "object" returned. The object format looks like 6873 Tile(x=475, y=238, z=11) while the string format looks like 6873 (475, 238, 11).

Moving forward, I'm working on creating a new property to the config object: tiles_obj (not a string) using self.set() that will be used to overwrite tiles series within combine_and_deduplicate() after deduplication. Then, when tiles is appended back during summarize(), the original tiles series is used if there was no deduplication, and the overwritten tiles series is used if there was deduplication.

julietcohen commented 1 year ago

Attempted solution to indexing error results in staging only 3 files with silent error / block

As described in the previous comment, a possible solution to the indexing error would be to move the assignment of the tiles series to after deduplication so it only reflects polygons that are not duplicates. Because the tiles and centroid_tile columns must be strings to save as attributes (rather than the custom morecantile object class), Robyn suggested that it might be possible to retain the conversion of these columns to strings here, then use the method self.tiles.tile_from_str() (documented here) to convert the tile column back to morecantile class in order to save with summarize().

Because tile_from_str() accepts a single tile string to convert to morecantile class, I use a loop to iterate over all strings within the tiles series after deduplication. I append each of the newly converted morecantile objects into a list, then append that list back into the data dataframe as the tile column to finally feed into summarize().

Modified save_tiles() in TileStager.py

``` def save_tiles(self, gdf=None): """ Given a processed GeoDataFrame, save vectors into tiled vector files. Parameters ---------- gdf : GeoDataFrame The GeoDataFrame to save """ if gdf is None: # TODO give warning return None for tile, data in gdf.groupby(self.props['tile']): # Get the tile path tile_path = self.tiles.path_from_tile(tile, base_dir='staged') self.tiles.create_dirs(tile_path) # Lock the tile so that multiple processes don't try to write to lock = self.__lock_file(tile_path) # Track the start time, the tile, and the number of vectors start_time = datetime.now() logger.info( f'Saving {len(data.index)} vectors to tile {tile_path}' ) # Save a copy of the tile column for the summary #tiles = data[self.props['tile']].copy() # Tile must be a string for saving as attribute data[self.props['tile']] = data[self.props['tile']].astype('str') tile_strings = data[self.props['tile']].astype('str') data[self.props['centroid_tile']] = tile_strings # Open the file in write mode by default mode = 'w' if os.path.isfile(tile_path): # If the file exists and config is set to deduplicate, then # open the file, append the new data, and identify duplicates. # Remove the data if the config is set to remove duplicates # during staging. Overwrite existing file. # count how many times the file exists logger.info('File exists, so starting deduplication if config is set to deduplicate at `staging`') dedup_method = self.config.get_deduplication_method() if dedup_method is not None: data = self.combine_and_deduplicate(data, tile_path) # If the file exists and config is not set to deduplicate, then # just append the new data to the existing file. else: mode = 'a' try: # Ignore the FutureWarning raised from geopandas issue 2347 with warnings.catch_warnings(): warnings.simplefilter('ignore', FutureWarning) data.to_file(tile_path, mode=mode) # convert each tile from string format to morecantile format # so it can be added to summary # first create series of tiles in str format tiles_str = data[self.props['tile']].copy() tiles_morecantile = [] for tile in tiles_str: tile_morecantile = self.tiles.tile_from_str(tile) tiles_morecantile.append(tile_morecantile) return tiles_morecantile # check that the above loop converted tile strings to morecantile objects the way I expect logger.info(f'tiles_str series is {tiles_str} and tiles_morecantile series is {tiles_morecantile}') # Record what was saved data[self.props['tile']] = tiles_morecantile self.summarize(data) finally: # Track the end time, the total time, and the number of vectors logger.info( f'Saved {tile_path} in {datetime.now() - start_time}' ) self.__release_file(lock) ```

While I acknowledge this code may need to be adjusted before it works the way I hope, some mysterious behavior resulted. Staging in parallel only resulted in 3 staged tiles, with no reported errors.

Deduplication did not occur at all, as indicated by an absence of the logging statement File exists, so starting deduplication if config is set to deduplicate at staging.

Additionally, no staging_summary.csv was written, meaning that summarize() did not run.

robyngit commented 1 year ago

Hey @julietcohen, you are getting closer!! The reason you're not seeing the File exists... logging statement is because of the return tiles_morecantile line in the loop. That statement works like a return for the entire save_tiles method, so nothing below that runs. You can just delete that line, the tiles_morecantile list gets updated with append, you don't need to return it from the loop.

julietcohen commented 1 year ago

Oh great! Thanks Robyn 👍🏼

julietcohen commented 1 year ago

Resolution

The following version of save_tiles() seems to do what we want:

save_tiles() in TileStager.py

``` def save_tiles(self, gdf=None): """ Given a processed GeoDataFrame, save vectors into tiled vector files. Parameters ---------- gdf : GeoDataFrame The GeoDataFrame to save """ if gdf is None: # TODO give warning return None for tile, data in gdf.groupby(self.props['tile']): # Get the tile path tile_path = self.tiles.path_from_tile(tile, base_dir='staged') self.tiles.create_dirs(tile_path) # Lock the tile so that multiple processes don't try to write to lock = self.__lock_file(tile_path) # Track the start time, the tile, and the number of vectors start_time = datetime.now() logger.info( f'Saving {len(data.index)} vectors to tile {tile_path}' ) # Tile must be a string for saving as attribute data[self.props['tile']] = data[self.props['tile']].astype('str') tile_strings = data[self.props['tile']].astype('str') # maybe change this to: tile_strings = data[self.props['centroid_tile']].astype('str') data[self.props['centroid_tile']] = tile_strings # check the class of tile column was converted to string logger.info(f"tile column is {data[self.props['tile']]} with class {data[self.props['tile']].dtypes}. " f"centroid_tile column is {data[self.props['centroid_tile']]} with class {data[self.props['centroid_tile']].dtypes}") # Open the file in write mode by default mode = 'w' if os.path.isfile(tile_path): # If the file exists and config is set to deduplicate, then # open the file, append the new data, and identify duplicates. # Remove the data if the config is set to remove duplicates # during staging. Overwrite existing file. # count how many times the file exists logger.info('File exists, so starting deduplication if config is set to deduplicate at `staging`') dedup_method = self.config.get_deduplication_method() if dedup_method is not None: data = self.combine_and_deduplicate(data, tile_path) # If the file exists and config is not set to deduplicate, then # just append the new data to the existing file. else: mode = 'a' try: # Ignore the FutureWarning raised from geopandas issue 2347 with warnings.catch_warnings(): warnings.simplefilter('ignore', FutureWarning) data.to_file(tile_path, mode=mode) # convert each tile from string format to morecantile format # so it can be added to summary # first create series of tiles in str format tiles_str = data[self.props['tile']].copy() tiles_morecantile = [] for tile in tiles_str: tile_morecantile = self.tiles.tile_from_str(tile) tiles_morecantile.append(tile_morecantile) logger.info(f'tiles_str series is {tiles_str} and tiles_morecantile series is {tiles_morecantile}') # Record what was saved data[self.props['tile']] = tiles_morecantile self.summarize(data) finally: # Track the end time, the total time, and the number of vectors logger.info( f'Saved {tile_path} in {datetime.now() - start_time}' ) self.__release_file(lock) ```

850 deduplication events were completed according to log.log ✅
in total 19088 .gpkg files were staged ✅
staging_summary.csv was written without error ✅

Note

The only unexpected thing I noticed was that while the tiles and centorid_tile column must be converted to strings to save as attributes, a check for the contents and class of the tiles column after we convert them to strings using this line: data[self.props['tile']] = data[self.props['tile']].astype('str') does not seem to show that these were converted to strings, but rather shows that it they are still morecantile objects:

tile column is 10076    Tile(x=475, y=241, z=11)
Name: staging_tile, dtype: object with class object. centroid_tile column is 10076    Tile(x=475, y=241, z=11)
Name: staging_centroid_tile, dtype: object with class object

Similarly, the check later on in save_tiles() for the contents and class of the series tiles_str when it is assigned after deduplication shows it was not string before my loop, but rather a morecantile object both before and after the loop:

tiles_str series is 6873    Tile(x=475, y=238, z=11)
Name: staging_tile, dtype: object and tiles_morecantile series is [Tile(x=475, y=238, z=11)]

So I tested whether these conversions were necessary by commenting out the initial conversion to string and commenting out the loop. The result was same class errors as before, so it seems these conversion steps are indeed necessary and the logged statements imply the code isn't doing its job when really it is. 🤷🏻‍♀️

PermafrostDiscoveryGateway / viz-staging

Index error during summarize step of staging #12

Reproducing staging error versus reproducing successful staging with different data inputs

Troubleshooting Update

Error persists:

Error does not occur:

First Check

Results

Second Check

Results

Does the source of the error occur within combine_and_deduplicate() ?

Review

Inserted "intermediate" logging

Proposed solution: move assignment of tiles series to post-deduplication

Attempted solution to indexing error results in staging only 3 files with silent error / block

Resolution

Note

Does the source of the error occur within `combine_and_deduplicate()` ?