GeoscienceAustralia / agdc

Repository for Australian Geoscience Data Cube (AGDC) code
BSD 3-Clause "New" or "Revised" License
29 stars 24 forks source link

Issues while updating a dataset (ingesting again updated data files) #80

Open didierearith opened 9 years ago

didierearith commented 9 years ago

Hi AGDC Team,

While I'm testing WOfS ingestion, I found an issue.

I have downloaded some WOfS file from http://dapds00.nci.org.au/thredds/catalog/fk4/wofs/current/extents in a directory on my machine. Then I run the ingest command for the first time: e.g agdc/ingest/wofs.py --source /home/adminprod/data1/rs0/tiles/wofs/

Ingestion of the data files is processed successfully.

Then I want to test an ingestion of existing data in the Data Cube (source files have been updated and I want to update my Data Cube).

To do this, I change the date of the source files (with the Linux 'touch' command).

The datetime of the data is now greater than the datetime of the dataset in the database.

I run again agdc/ingest/wofs.py --source /home/adminprod/data1/rs0/tiles/wofs/ and I get the following exception:

2015-08-04 11:56:02,123 agdc.ingest.tile_contents INFO Tile already in place: '/home/adminprod/data1/rs0/tiles/wofs/LS7_ETM_WATER115-035_2011-01-10T01-59-19.155557.tif' 2015-08-04 11:56:02,217 agdc.ingest._core INFO Ingestion complete for dataset '/home/adminprod/data1/rs0/tiles/wofs/LS7_ETM_WATER115-035_2011-01-10T01-59-19.155557.tif' in 0:00:00.197192. Traceback (most recent call last): File "/home/adminprod/agdc-develop/agdc/ingest/wofs.py", line 97, in agdc.ingest.run_ingest(WofsIngester) File "/home/adminprod/agdc-develop/agdc/ingest/_core.py", line 586, in run_ingest ingester.ingest(ingester.args.source_dir) File "/home/adminprod/agdc-develop/agdc/ingest/_core.py", line 186, in ingest self.ingest_individual_dataset(dataset_path) File "/home/adminprod/agdc-develop/agdc/ingest/_core.py", line 207, in ingest_individual_dataset self.tile(dataset_record, dataset) File "/home/adminprod/agdc-develop/agdc/ingest/pretiled.py", line 312, in tile dataset_record.store_tiles([tile_contents]) File "/home/adminprod/agdc-develop/agdc/ingest/dataset_record.py", line 238, in store_tiles return [self.create_tile_record(tile_contents) for tile_contents in tile_list] File "/home/adminprod/agdc-develop/agdc/ingest/dataset_record.py", line 320, in create_tile_record size_mb=tile_contents.get_output_size_mb(), File "/home/adminprod/agdc-develop/agdc/ingest/tile_contents.py", line 174, in get_output_size_mb return get_file_size_mb(path) File "/home/adminprod/agdc-develop/agdc/cube_util.py", line 109, in get_file_size_mb return os.path.getsize(path) // (1024 * 1024) File "/usr/lib/python2.7/genericpath.py", line 49, in getsize return os.stat(filename).st_size OSError: [Errno 2] No such file or directory: '/home/adminprod/data1/rs0/tiles/wofs/LS7_ETM_WATER115-035_2011-02-27T01-59-34.560472.tif' 2015-08-04 11:56:02,352 agdc.ingest._core ERROR Unexpected error during path '/home/adminprod/data1/rs0/tiles/wofs/LS7_ETM_WATER115-035_2011-02-27T01-59-34.560472.tif'

After some investigation, I think the issue is due to the fact the data file is removed in the '__commit' function of the 'collection.py' module: i.e.

    # Remove tile files just after the commit, to avoid removing
    # tile files when the deletion of a tile record has been rolled
    # back. Again, tile files without records are possible if there
    # is an exception or crash just after the commit.
    #
    # The tile remove list is filtered against the tile create list
    # to avoid removing a file that has just been re-created. It is
    # a bad idea to overwrite a tile file in this way (in a single
    # transaction), because it will be overwritten just before the
    # commit (above) and the wrong file will be in place if the
    # transaction is rolled back.

    tile_create_set = {t.get_output_path()
                       for t in self.tile_create_list}
    for tile_pathname in self.tile_remove_list:
        if tile_pathname not in tile_create_set:
            if os.path.isfile(tile_pathname):
                os.remove(tile_pathname)

To be able to ingest again the updated data source files, I have comment the 'os.remove' instruction above.

Note if the data source have not been updated (i.e. data of the source file = date of the database dataset), there is no issue.

Note If I run again the ingestion, the issue doesn't occur always on the same file: sometimes on the first file, sometimes on the nth file.

jeremyh commented 9 years ago

Thanks Didier.

We hit this bug last week ourselves in the development code – the overlap cleaner identified the second tile as redundant, which for other ingesters implies tile removal, and this was incorrectly running during WOfS ingestion. The WOfS ingester should be runnable with read-only access to its inputs (which is how we're running it), so any file modification is a serious bug.

Try updating to the latest version of the develop branch and retesting.