ISSUE-8: Add CSV reporting + Skip on missing file(s)

DiegoPino commented 2 years ago

See #8

This involves

New AMI Set Field where we will attach a CSV and/or append records to it on each processing
Reports Tab where you can see by type/filter/alerts/info about the last processed set matching also 1:1 to the actual CSV data (so label/node_uuid, etc will match too + timestamp, what happened, why, etc)
A new option for processing (enabled by default) that will skip processing a ROW leading to an ADO if any of the file processing/downloading/fetching fails. (missing file, remote disconnects, etc)

@karomabiles @alliomeria keeping you in the loop. This is my first pass, will finish ~tomorrow~ someday. Should have done this sooner!

ps:awwwwwwwww

DiegoPino commented 2 years ago

Still work on this. A few things I need to figure out. For AMI sets coming from S3:// in some cases (can't reproduce yet) the S3 is correctly moved from one prefix Into the final destination (desired since we can't keep files e.g in s3://uploads) but it seems that some type (enqueued ?) processing leaves some files as "permanent" (because we need them around) and fails when connecting them from S3 into a File entity to make them "Temporary", the current trigger to actually execute the move at the SBF Module level (the File Persister Service). Need to research more but to fix this also I will add a VBO action that allows to do File movement/fixing and even renaming post-mortem.

DiegoPino commented 2 years ago

Will use your language or something very close. Thx!

DiegoPino commented 2 years ago

@patdunlavey could you give this a look and let me know if you see anything breaking your current needs? I have to totally refactor the way the AMI Ingest Queue worker does the file fetching to allow for a more safe cache. I want to be sure this does not break your functionality.

One of the many reasons why I will stick with the idea of Archipelago managing files is because if you keep a Source File in place in S3://, then delete the ingest via AMI, the file entity will become automatically "temporary" and Drupal will eventually delete the File and the original path ... basically making your file disappear ...

Thanks

DiegoPino commented 2 years ago

I still believe this @alliomeria is very good language

"If enabled all files referenced in this AMI set will be copied into an Archipelago managed location and sanitized as defined in this repository’s configured Storage Scheme for Persisting Files. If disabled those files will maintain their original location as defined in your file source column(s) and it will be up to your administrator to ensure they are not removed from there."

Maybe we can remove the "if enabled" now that I made it a super hidden feature and make it part of the docs?

aryalsujay commented 2 years ago

Sharing our perspective here. Mostly we will go with archipelago file persitence strategy. Just for some parts, let me explain the "some parts", the original files that are already managed, renamed, locally, for these files we want them to be in the place they already are (This will be for internal use only, only the location part). Now, coming to files that will be made public, where users can interact, edit , and add their own edited files, this part will cover more than 60-70% part. Here, we will go with Archipelago file persistence strategy.

Many things have already been clarified in slack. Thanks to @DiegoPino & @alliomeria.

patdunlavey commented 2 years ago

@DiegoPino I'm not 100% sure what this means: "One of the many reasons why I will stick with the idea of Archipelago managing files is because if you keep a Source File in place in S3://, then delete the ingest via AMI, the file entity will become automatically "temporary" and Drupal will eventually delete the File and the original path...". I think you're describing why Archipelago will handle file management, and not use Drupal's managed file system. Is that correct?

DiegoPino commented 2 years ago

@patdunlavey good morning:

This is what I mean:

You have a file in s3://uploads/mysource.jpg
You disable archipelago's file management for files that share the final destination scheme (e.g S3 in this case). Means you want the file to be ingested but to stay put in the same place.
You ingest via AMI using the s3://uploads/mysource.jpg as source. Archipelago will generate an ADO and a file with that URL as source. All should work.
You decide to delete the AMI ingest because X reason (metadata needs refinement, you forgot to add more files, etc).
You wait a few hours for your team to come back with a new CSV to reingest
Cron decides to run while you wait. The recently ingested ADO was deleted and thus the File attached with it using the s3://uploads/mysource.jpg becomes temporary. Drupal has the default setting "Delete temporary files after 6 hours" at http://localhost:8001/admin/config/media/file-system
The new file using s3://uploads/mysource.jpg matches that pattern. Drupal will delete the file entity AND the file source!
You lost your original source. Your team is angry at you. You blame us!

Ways of avoiding this are: to TAP into /intercept into the File's entities predelete function and add somewhere a very complex logic saying "hey" don't touch this. But then we need the inverse logic to actually clean up... so how we define "what" can be deleted and what not?

public static function preDelete(EntityStorageInterface $storage, array $entities) {
  parent::preDelete($storage, $entities);

  foreach ($entities as $entity) {
    // Delete all remaining references to this file.
    $file_usage = \Drupal::service('file.usage')->listUsage($entity);
    if (!empty($file_usage)) {
      foreach ($file_usage as $module => $usage) {
        \Drupal::service('file.usage')->delete($entity, $module);
      }
    }
    // Delete the actual file. Failures due to invalid files and files that
    // were already deleted are logged to watchdog but ignored, the
    // corresponding file entity will be deleted.
    try {
      \Drupal::service('file_system')->delete($entity->getFileUri());
    }
    catch (FileException $e) {
      // Ignore and continue.
    }
  }
}

Second option would be to never let Drupal delete at all. And have our own cleaning mechanisms

By letting archipelago manage the storage, files do get deleted when you delete an ADO (of course we need to clean up a bit) but you will never delete a file from source

Hope this makes sense. This is not a "use case" more a reality. So happy to hear about use cases. Basically the idea is: Copying/downloading a source file is safe. Using it from the same place might lead to deletion of the source file, which might not be "clear" to a user.

patdunlavey commented 2 years ago

Thanks @DiegoPino . I'm still not sure if you are proposing (or suggesting the possibility of) not-yet-developed changes to Archipelago: from using the Drupal managed-file system where files associated with deleted objects get marked for deletion and are cleaned out on cron, to an "Archipelago managed file system", where we apply special logic to determine if deleting an object should also cause associated files to be deleted.

I am probably missing something, or misunderstanding (first time ever!!), so ignore this if it doesn't make sense... The problem is not that we're using Drupal's managed file system. It's that deleting an ADO necessarily causes associated managed files to be marked for deletion. Can't we include as part of an as:[file type] array, some bit of metadata that allows us to determine if deleting the ADO should cause the file to be marked for deletion? Or perhaps in the as:generator section, it could list the originating file paths for the files, and during the ADO delete process, check to see if they're the same and if they are, don't mark for deletion?

Just some ideas!

patdunlavey commented 2 years ago

Maybe a crazy idea... I recall an idea that we discussed eons ago, which is to protect files from being deleted by creating a "pseudo-file-usage" when the associated object is deleted:

/** @var \Drupal\file\FileUsage\DatabaseFileUsageBackend $file_usage */
$file_usage = \Drupal::service('file.usage');
$file_usage->add($file, 'strawberryfield', '', 0);

The 3rd and 4th parameters of the add() method are type and ID, and I believe that they can be any string - i.e. not an actual entity.

So this would run in the node delete hook, presumably - utilizing something like the logic I described in the previous comment to decide when it needs to create the file_usage entry. On the ingest side, we would probably need a thing that removes the fake file_usage when the real file_usage is (re)created.

DiegoPino commented 2 years ago

@patdunlavey the issue with protecting files is how to clean them up. What is the criteria of protecting a file and when can it be released

alliomeria commented 2 years ago

@DiegoPino, this is all so great and very helpful to have on hand! The Reports tab for the different sets/tests I've gone through in my local is very useful to refer to (all updating/syncing nicely), breadcrumbs for the AMI sets is awesome (yay!), the dropdown in the AMI sets page to go right to the Report 😎, the Status updates in the AMI sets page--so many good things! 🌟🙏 Will come back for deeper testing and share comments if I encounter anything I'm unsure about. So far the essentials and new goodies so good! 😄

DiegoPino commented 2 years ago

Thanks @alliomeria happy to provide more tiny tools and some potted flowers for the metadata garden

esmero / ami

ISSUE-8: Add CSV reporting + Skip on missing file(s) #123