ExternalTable.unused() inconsistent with ExternalTable.used()

simon-ball commented 2 years ago

Bug Report

In an established pipeline, the ExternalTables cannot be cleaned up because .unused() returns zero rows, even in cases where I know categorically that at least 1 (and in practice, >10 million) rows should be deleted.

Description

I have a group of schemas making use of a single S3-type store (object storage on a private Openstack instance).

The store contains approximately 40 million objects (1.5TB), of which a considerable number (my rough estimate is about 10 million) are unused. Some of those objects are still recorded in the ExternalTables for the schemas. I also know that some completely orphaned objects exist, which are not recorded in the ExternalTable of any schema, but they fall outside the scope of this bug.

Datajoint Docs point to the following command to clean up the ExternalTable and the external filestore:

schema.external['external_raw'].delete(delete_external_files=False)

Which based on reading the source code itself makes use of the call:

schema.external['external_raw'].unused()

However, this command always produces exactly zero responses (which is definitely incorrect, as I have spent the past several days deleting a non-trivial amount of data from two of these schemas)

for module in (tracking, ephys, analysis):
    print(len(module.schema.external["ephys_store"].unused()))
>>> 0
>>> 0
>>> 0

If, instead, I compare the opposite command, .used() to the total length of the table, it tells a very different story. Here I would expect total to be equal to the sum of used and unused. diff expresses how the actual result diverges from the expectation. It should always be zero, but in practice is not.

for module in (tracking, ephys, analysis):
    a = module.schema.external["ephys_store"]
    used = len(a.used())
    unused = len(a.unused())
    total = len(a)
    diff = (total-used) - unused
    print("schema {} | total {} | used {} | 'unused' {}  | diff {}".format(module.schema.database, total, used, unused, diff))
>>>schema group_shared_tracking | total 17304 | used 15681 | 'unused' 0  | diff 1623
>>>schema group_shared_ephys | total 8348702 | used 6567742 | 'unused' 0  | diff 1780960
>>>schema group_shared_analysis | total 27542256 | used 16040135 | 'unused' 0  | diff 11502121

Note that unusedremains zero, even though total is not equal to used

I can't offer any suggestion why - the source code appears superficially correct, i.e. used and unused are include vs exclude the same restriction. It just doesn't work.

Reproducibility

This gets tricky. This is perfectly reproduciable against our production database server, across multiple operating systems and multiple versions of Datajoint:

Schema was originally declared in datajoint 0.12.4
Production runs in datajoint 0.12.7 on Ubuntu 18.04
The above behaviour is reproducible against our production server under Windows 10 21H2, Ubuntu18.04 and Ubuntu 20.04 with both datajoint 0.12.7 and 0.13.3 on each OS.

However, I cannot concoct a toy example that shows the same behaviour. The code below works exactly as it should work, with .unused() returning the expected number, and .delete() removing the correct number.

So this may be a historical anomaly (although adding in 0.12.4 to the below example hasn't reproduced it), or it may be a scale example (I don't have enough spare storage space to reproduce 1TB of random test data), neither of which may be amenable to being fixed.

import datajoint as dj
import numpy as np

schema = dj.schema("user_simoba_test_deletion")

@schema
class BlobTable(dj.Manual):
    definition = """
    id : int
    ---
    blob_1 : blob@ephys_store
    blob_2 : blob@ephys_store
    blob_3 : blob@ephys_store
    """

def add_data(n, seed=1):
    rng = np.random.default_rng(seed)
    BlobTable.insert(
        [
            dict(
                id=i,
                blob_1=rng.random(50),
                blob_2=rng.random(60),
                blob_3=rng.random(70),
            ) for i in range(n)
        ]
    )
    return

N = 10
ext = schema.external["ephys_store"]

# Check that everything is as expected after initialising the schema and adding N rows: there should be 3N blobs, and all of them should be in use
add_data(N, 1)
assert len(ext) == N*3
assert len(ext.used()) == N*3
assert len(ext.unused()) == 0

# Now we delete Z of those entries
# Then, there should still be 3N blobs, but 3Z of them should be unused
Z = 4
with dj.config(safemode=False):
    (BlobTable & [{"id": i} for i in range(Z)]).delete()
assert len(ext) == N*3
assert len(ext.used()) == 3*N - 3*Z
assert len(ext.unused()) == 3*Z # in production, this raises an error. In this toy example, it does _not_ raise an error, which is the correct behaviour
ext.delete(delete_external_files=True)
assert len(ext) == 3*N - 3*Z

# Cleanup
with dj.config(safemode=False):
    BlobTable.delete()
ext.delete(delete_external_files=True)
schema.drop()

Expected Behavior

A clear and concise description of what you expected to happen.

Screenshots

If applicable, add screenshots to help explain your problem.

Additional Research and Context

Add any additional research or context that was conducted in creating this report.

For example:

Related GitHub issues and PR's either within this repository or in other relevant repositories.
Specific links to specific lines or a focus within source code.
Relevant summary of Maintainers development meetings, milestones, projects, etc.

horsto commented 1 year ago

I am running into the same issue on my side. .unused() is empty and > 200,000 objects that are not referenced are not being deleted / cleared.

Using @simon-ball's code above I am getting this:

schema.external["octo_store"].delete(delete_external_files=True) returns nothing and finishes in <3 seconds.

horsto commented 1 year ago

With help from @ttngu207 I have stitched together this script to get rid of unused files both in S3 and in the external table (entries). https://gist.github.com/horsto/4826a58837ef37a7f8c4da94981973b8

datajoint / datajoint-python