DiegoPino commented 7 months ago

What?

EADs are complex/fixed realities (XML) with sometimes thousands of containers. So far (@alliomeria is this true?) or approach of making each Container an ADO connected to its Top One + a connection to its parent one, has show potential and good concern separation. But that also implies, a batch set might imply 100s of thousands of rows. So ... To avoid User errors and since the data modeling opportunities here could overwhelm our community (and you all already have a generic CSV importer) the task at hand here is

Create an opinionated (like me) but kind (like you) Plugin that takes A) a CSV generated by another well formed script (you run this on your local) or (pass 2 of this issue) B) a ZIP file with XML files. (yes a ZIP, not remote, not anything).
For the former, You can have nested CSVs. The main CSV can contain in each ROW a column referencing another CSV (which needs to be provided as a file either in S3:// or inside a ZIP:// our an URL:// (same as you would with e.g an Image attached to an ADO).
A mapping to select which template to run for Rows, which template to run for the per-row CSV rows. (so max 2 templates)
A mapping to select which bundle to run for the main + for the rows. By default, if our custom Drupal Content type was loaded/created/present (will ship with a next release) no selection will be available (same bundle)
Types are fixed (so much control!)
A new Queue worker which has as only purpose to read a CSV and push future items into the queue. This new Queue worker can/will be also available for the other plugins, allowing (new feature) to avoid waiting while the "form" processes each CSV row, basically delegating that very action to a queue item. For the EAD itself, this will basically push into the queue after every EAD ADO queue item, this one for the Child CSV generating this pseudo structure
```
`The queue` = ['Process Main CSV'] -> runs
`The queue` = [' Ingest First Row of Main CSV', 'Process first Row's CSV', ' Ingest Second Row of Main CSV'...] -> runs
`The queue` = [' Ingest First Row of Main CSV', 'Ingest First Child from 'Process first Row's CSV', 'Ingest Second Child from 'Process first Row's CSV', ' Ingest Second Row of Main CSV'...] -> runs
```

DiegoPino commented 7 months ago

Few notes for @alliomeria

Because of the nested situation we can't allow CSVs that are nested to not have UUIDs. The UUIDs need to be pre-set ... if not this becomes a pre-processing nightmare. The only other option I can see here ( a setting) is to use a "seed" so the UUIDs for each ADO coming from a nested CSV gets an automatic UUID but that UUID is always consistent with A) the parent one + the row number? (the UUIDV5 thing I mentioned before)

This is a lot of work. But I think we can't skip this functionality.

alliomeria commented 7 months ago

Hello @DiegoPino !

Understood about the need for pre-setting the UUIDs and the potential pre-processing nightmare.

Going to put out a few thoughts here, hope some of this helpful for your consideration.

As we've been discussing this, I've been thinking of this as a process that could/would only unfold in a particular order-of-operations, such as: 1.) Wait until all Parents (EADs) are fully ingested 2.) Check to see if Parent was ingested 2a.) Provide an intermediate AMI Set Output Report: all Parents processed successfully, children enqueued --> separate Report tab for children?

From your above Issue comment:

For the former, You can have nested CSVs. The main CSV can contain in each ROW a column referencing another CSV (which needs to be provided as a file either in S3:// or inside a ZIP:// our an URL:// (same as you would with e.g an Image attached to an ADO).

3.) If Parent was ingested + has a value for a corresponding CSV, use the main CSV to maps the UUIDs + now node_id's for every Parent into the corresponding ispartof / containedby value for each ROW in the referenced CSV. Also auto-assign UUIDs for every child row at this time.

4.) Enqueue + background Process all the child objects.

For reversing/undoing-->for EAD Parent + Child (Container Objects), would it be possible to have a secondary option to 'Delete all Processed Parents + Children' combined?

Again, just writing out here for reference sakes, I understand that you've already worked through much of the needed functionality and different process checks. Looking forward to discussing with you live just a bit later today! :)

DiegoPino commented 7 months ago

@alliomeria thanks so much for your comments and workflow ideas.

So far I think I am aligned with your needs and use cases. There are a few things code wise I need to figure out, like 4). the Delete all children part, i totally agree that leaving containers behind without parents makes no sense. Maybe time to start planning for 1.5.0 (so later this year) for a move into fully UUID driven relationships without affecting current code/workflows. That is a larger issue

For the UUID generation of the Children CSVs I was thinking of requiring those to be set initially by the AMI owner, but I can see how that might limit the possibilities. It is tricky to edit those since they might be either remote/contained in a ZIP, but I think it can be done. I will propose here to use UUID V5 (will test though manually first). UUID V5 require another UUID as name space (so the Parent's UUID) and a string (a unique id, like an archivespace ID?) to generate. They are consistent and compatible (the fact that they are UUIDV5 is encoded in the output (the string itself) and will never "clash" with an UUIDV4, but will validate correctly.

More soon. I have tons of little tests to do before sharing. Appreciate your time and wisdom

esmero / ami

The EAD AMI Ingest Plugin #195

What?