WIPACrepo / file_catalog

Store file metadata information in a file catalog
MIT License
1 stars 4 forks source link

Add Support for Transient Files #79

Open ric-evans opened 3 years ago

ric-evans commented 3 years ago

Off the bat, this might mean eliminating the "logical_name" field.

Current Relevant Scenarios:

  1. When an actual file is moved, the corresponding FC record's "locations" object is manually updated. However, the "logical_name" retains the original path.
  2. When an actual file is deleted the FC record is not deleted. Should it be?

Proposal:

  1. Remove the "logical_name" field. This is at best redundant, and at worse a red herring.
  2. Add an "active" field/flag to each "locations" object-entry: "active": True indicates the filepath is still valid.
  3. Add a service to regularly check up on FC records
    • This could either be an active service on a server;
    • or a passive service tied into the FC REST server that updates FC records only when a filepath is requested (requires access to lustre)
dsschult commented 3 years ago

We've gone back and forth about whether to keep deleted file records. I think there are three options:

  1. Keep everything forever
  2. Delete locations, but keep the metadata
  3. Delete the record if there are no remaining locations

I am currently in favor of 2 or 3. I'd lean towards 3 right now, as it's hard to imagine why you'd want an entry with no locations. What we'd really be interested in is an audit log, if we wanted to undo things (or see who did things).

ric-evans commented 3 years ago
  1. Delete the record if there are no remaining locations

I am currently in favor of 2 or 3. I'd lean towards 3 right now, as it's hard to imagine why you'd want an entry with no locations. What we'd really be interested in is an audit log, if we wanted to undo things (or see who did things).

The issue I see with outright deleting the record is if the file was deleted by mistake, or lost. There should be a way to get the metadata back without having to re-index the file, or alternatively be able to use the FC record to verify things. Adding a "verified timestamp" (along with an "active" flag) to each location could serve as a sparse audit log. Then we can have a service that deletes records that are "active": False and have a "verified timestamp" that is "old enough".

ric-evans commented 3 years ago

We've gone back and forth about whether to keep deleted file records. I think there are three options:

  1. Keep everything forever
  2. Delete locations, but keep the metadata
  3. Delete the record if there are no remaining locations

I am currently in favor of 2 or 3. I'd lean towards 3 right now, as it's hard to imagine why you'd want an entry with no locations. What we'd really be interested in is an audit log, if we wanted to undo things (or see who did things).

TLDR from my comment above,

  1. Quarantine deleted-filepath entries w/ a timestamp (then potentially remedy at a later date)
jnbellinger commented 3 years ago

I'm seeing a situation in which files were deleted deliberately, but the record remains and is causing problems.

If these had been actual data files, it might be good to retain a little information on what was lost. I have seen files be accidentally renamed/moved(*). That's rare, but being able to say "The checksum matches file X that we thought was lost" is a possible reason to keep the record information around. But that's not the same thing as a file record anymore, so if we save the information at all we shouldn't call them file records. And I don't want to pick them up as some of the expected contents of a directory.

So I think I'm arguing for something like (2), but with the requirement that a query for directory information must exclude the missing files unless I specifically ask for them.

I agree that an audit is important.

(*) e.g. with a lustre crash

dsschult commented 3 years ago

One option for the "quarantine and delete later" is an archive collection, which will remove the entry from normal searches but allow us to still see it.

ric-evans commented 3 years ago

Responding to both of you:

@jnbellinger:

So I think I'm arguing for something like (2), but with the requirement that a query for directory information must exclude the missing files unless I specifically ask for them.

@dsschult:

One option for the "quarantine and delete later" is an archive collection, which will remove the entry from normal searches but allow us to still see it.

I'm not sure the overhead of an independent collection will pay off. We could keep the quarantined records/locations in the files collection, add a flag to each location, then filter these out for normal searches.

dsschult commented 3 years ago

One option for the "quarantine and delete later" is an archive collection, which will remove the entry from normal searches but allow us to still see it.

I'm not sure the overhead of an independent collection will pay off. We could keep the quarantined records/locations in the files collection, add a flag to each location, then filter these out for normal searches.

Right now, with the way things get indexed, it would pay off in speed / not having to create even more indexes :smile:

ric-evans commented 3 years ago

One option for the "quarantine and delete later" is an archive collection, which will remove the entry from normal searches but allow us to still see it.

I'm not sure the overhead of an independent collection will pay off. We could keep the quarantined records/locations in the files collection, add a flag to each location, then filter these out for normal searches.

Right now, with the way things get indexed, it would pay off in speed / not having to create even more indexes

True, I'm assuming we fix the scaling-speed problem.

I'm thinking about files that are sent to NERSC and are later deleted off lfs. How would we consider this scenario (when not all the locations are active/inactive)?

jnbellinger commented 3 years ago

What do you mean "When not all the locations are active?" If we delete the PFRaw from lfs but have NERSC and/or the Pole disks, we still want to keep the file information active, right? We just remove one of the "pointers" associated with it. If we accidentally delete the PFRaw, NERSC melts down, and we have a fire in Chamberlin that roasts the Pole copies, we might as well change the file states--manually.

dsschult commented 3 years ago

Files that get sent to NERSC get another location entry for that. So even if we delete the file from UW, there's still a location entry for NERSC.

ric-evans commented 3 years ago

What do you mean "When not all the locations are active?" If we delete the PFRaw from lfs but have NERSC and/or the Pole disks, we still want to keep the file information active, right? We just remove one of the "pointers" associated with it. If we accidentally delete the PFRaw, NERSC melts down, and we have a fire in Chamberlin that roasts the Pole copies, we might as well change the file states--manually.

I'm using "locations" to mean "pointers", potato potato. Here's what I'm proposing:

Current Record Schema:

{ 
    <other-fields>,
    "locations": [
        {
            "site": "WIPAC",
            "path": "/data/exp/IceCube/path/to/file",
        },
        {
            "site": "NERSC",
            "path": "nersc:path/to/file",
        },
    ],
    <other-fields>
}

Proposed Record Schema:

{
    <other-fields>,
    "locations": [
        {
            "site": "WIPAC",
            "path": "/data/exp/IceCube/path/to/file",
            "active": False, // AKA file was deleted from lfs
            "verified-timestamp": <timestamp>, // some time after the path was noticed to be invalid/deleted/etc.
        },
        {
            "site": "NERSC",
            "path": "nersc:path/to/file",
            "active": True,
            "verified-timestamp": <timestamp>,
        },
    ],
    <other-fields>
}
jnbellinger commented 3 years ago

If I ask "What is in the directory WIPAC:/data/exp/IceCube/2025/unbiased/Gen2Raw/0231", how will the query know what files to return?

ric-evans commented 3 years ago

If I ask "What is in the directory WIPAC:/data/exp/IceCube/2025/unbiased/Gen2Raw/0231", how will the query know what files to return?

Currently that's not possible, but it will be very soon: https://github.com/WIPACrepo/file_catalog/issues/77

In my proposal, the query results would be filtered to only include records where "active" is True. Unless otherwise indicated.

ric-evans commented 3 years ago

If I ask "What is in the directory WIPAC:/data/exp/IceCube/2025/unbiased/Gen2Raw/0231", how will the query know what files to return?

Technically speaking, the FC would have a MongoDB index for each "locations"-entry filepath. This is a result of eliminating the "logical_name" field.

jnbellinger commented 3 years ago

Each "filepath" or the whole site+filepath?

ric-evans commented 3 years ago

Each "filepath"

A.K.A whatever is in "locations" -> "path"

jnbellinger commented 3 years ago

I can easily imagine another site using the same sort of /data/exp/ path name that we do. OTOH, any index at all would expedite the search, even if we needed another clause to get rid of the MALMO files.

ric-evans commented 3 years ago

Fair point. The requester would need to include the "site" field in their request query, or do client-side filtering like you said.

blinkdog commented 3 years ago

I think the File Catalog has had a bit of an identity crisis since it was created.

As David said, we've gone back and forth about what to do with deleted records. And I think this is less about what we should do, and rather more about not understanding what the File Catalog is (or is intended to be).

Here are two alternate visions for the File Catalog:

  1. Extended Filesystem Metadata (a canonical record of IceCube's data as it currently exists)

If a file has a record in the File Catalog we know that it is a file that we intend to spend resources to keep. We know the canonical identity of the file (checksum), we know the canonical place the file would appear in a Data Warehouse file system (logical_name), and we know where we have copies of the file (locations), either directly or as part of an archive.

Files that are not part of IceCube's data as it currently exists are not eligible for the File Catalog. Anything deleted or lost should be removed from the File Catalog. "You don't have to go home, but you can't stay here."; the record can live somewhere else (if desired) but the File Catalog proper is always and foremost about the present state of the (Extended) Data Warehouse.

  1. Oracle of File Metadata (the record of all we know about IceCube's data)

If a file has a record in the File Catalog we know that it was a file that we processed at one time.

Some data may be missing, including metadata about the file and even the file itself (if there are no known locations of the file). The File Catalog record reflects our best knowledge of the file, and serves as both a means to find active files and a means to remember the deleted and lost.

Of this File Catalog we can ask questions like "Has a file with any other checksum lived in the Data Warehouse at /path/to/file?" Or "Which version(s) of $logical_name are archived and where?" Or "Which files on retro media were lost when the container ship caught on fire three years ago?"

I think both of these visions are valid and useful for their intended purposes. However, there is a fundamental incompatibility between the two. We need to choose one. After that, I think questions about how to structure things and what to do will fall out almost naturally.

To be fair, I think the current use cases of the File Catalog fall into vision 1. LTA is concerned with what files need to go into archives, and updating those same files when the archives are known-good at the destination.

Consider this use case:

  1. We archive some data; let's call it Level Z
  2. Later, somebody realizes an obscure error had a rare effect on some data. Level Z gets the nod.
  3. We recall the Level Z, fix it, re-index it for bundling. 3A. Locations says 'You can't have two files at /data/warehouse/path'; meaning the broken Level Z file (gone from the warehouse, but the record is still intact) and the fixed Level Z file (in the warehouse, but we can't make a record for it, because the broken one has that location, and duplicates are forbidden) 3B. We remove the record for the broken Level Z because why would we want that. Fixed Level Z gets a new record that contains the new path.
  4. We bundle the fixed Level Z files up and send the archive to NERSC.
  5. Some time later, somebody queries the File Catalog for a specific time frame and finds a Level Z file (not the one we fixed, but maybe a directory and archive sibling) and sees the file lives in two archives at NERSC (one containing the broken sibling, one containing the fixed sibling), and recalls both.
  6. When the bundles come back, the checksum for the broken sibling fails. (File data checksum does not match the checksum from the Catalog record.)

The question on our minds: What is this file and why is the checksum not matching?

Is the File Catalog the service that should be answering that question? Does it tell us, "That's an older version of file X that was superceded by file Y?" Or does it tell us, "Well, it doesn't match what I have on file. Look it up in the oracle service maybe?"

dsschult commented 3 years ago

@blinkdog That's an excellent point. My vision is of 1. Extended Filesystem Metadata, except that I expand the data warehouse to "anything I can get the file back from." So I consider NERSC and DESY to be perfectly acceptable locations for files I care about.

In the overwrite example, I would first delete the local location from the old record, then add a new record with that location. It would be up to some cleanup operation to delete the NERSC archive, and finally delete the old record. Of course, this comes with a problem if you wanted unique logical names, as you would have two of them for a short time. But if this example is how we want things to appear, there are software solutions to make that happen.

jnbellinger commented 3 years ago

As a user who has been granted umpteen hours on a cluster in Istanbul, I want to know from where I can pull the data, and don't care particularly whether Madison has it or not. A Madison-centric system is too limited. So: Case 2 or David's extended Case 1

You're right, this demands some kind of version control in the FileCatalog and the retrieval procedures. It seems perfectly possible for some site to have an older level2 version, or a mix (replacement is still in progress). An analysis working on a 10-year study at some site may want to stay bug-compatible for the lifetime of the analysis--and not mix in newer data file versions.

jnbellinger commented 3 years ago

WRT Analysis Reproducibility: An analysis record should refer to what version of the data files it used.

ric-evans commented 3 years ago

@blinkdog That really does put this in perspective. I hadn't thought about files that are modified but remain at the same path.

I like @dsschult's Extended Case 1, let call it Global Filesystem Metadata (FWIW I don't think @blinkdog restricted his original case 1 to Madison-only files).

WRT overwritten files, we could move the original record to a "graveyard" collection (where we don't care about duplicate paths). @dsschult proposed something similar earlier

TLDR 2 collections: (1) a collection for data files globally accessible as of today, and (2) a collection for data files no longer accessible, AKA the "graveyard".

ric-evans commented 3 years ago

In a radically different approach, we could only require unique checksums (the sole primary key), and keep everything forever.

jnbellinger commented 3 years ago

Empty files have the same checksum :-)

dsschult commented 3 years ago

Yeah, checksums aren't unique because of that issue (and other small files that would be the "same"). While technically the contents are identical, the metadata could be different.

ric-evans commented 3 years ago

too radical I suppose :laughing:

ric-evans commented 3 years ago

Another issue with this automation is that a file's metadata changes when a gaps file (or other auxiliary file in the same directory) is added/modified. The individual file's checksum remains the same but it would still need to be re-indexed.

ric-evans commented 3 years ago

Further discussion, including new and relevant use cases, by @jnbellinger: https://docs.google.com/document/d/1DkzX5VDNTxmQUOofkdGdbfykZE6Dvu8VFShyCZ19_1I

ric-evans commented 3 years ago

This issue is spinning off into https://github.com/WIPACrepo/file_catalog/issues/109, which will create an interem solution.

Updates to follow.