Clarify toolchain inputs/outputs or make content modules [META ISSUE]

whikloj commented 7 years ago

My difficulty with MIK is that while it says you can configure and add/remove parts. In fact much is required by the code and so there is a lot of duplication of code.

If you want to make this easier to extend, I think one of two things would be best.

Make each step in the toolchain more clearly defined with more rigid requirements. (ie. no longer allow fetchers to return an object or an array)
- For example, make fetchers return an array with "ID", "location", ... etc as required fields with the option of additional optional information, then make fetcher manipulators expect the output from fetchers as their input. So they are not tied together and can be re-used. (CdmSingleFileByExtension.php and CsvSingleFileByExtension.php fetchermanipulators, both do the same thing but because they use different fetchers separate classes had to be created).
Move to a module based approach, where the input source (CSV or CSM, etc) is the key and much of the code is tied together more tightly based around that defining factor. Other things (like metadataparsers) could still be passed in and the input source code could interact using a narrower interaction model (ie. parse a defined input to a defined output).
- Perhaps fetchers and filegetters should be into a single class, even metadataparsers rely on one of these to get the initial metadata, perhaps it to should be combined.

One of the things I found frustrating is that the Toolchain image appears to indicate data flows down the chain. But in reality much of the chain is knotted (as in you can't remove it), looped and called even if you don't configure it.

mjordan commented 7 years ago

@whikloj my vision for Move to Islandora Kit focuses on "kit", where use of the out-of-the-box components is accomplished via configuration, and extending the kit for local, specialized use cases is possible via standard OOP techniques (and other mechanisms such as post-write and shutdown hook scripts, which we don't consider part of the core MIK codebase).

Your experience coming in as a (new to the application) developer wanting to understand how MIK works and wanting to leverage its intended extensibility has been extremely eye opening to us. We totally want to make MIK easier to use for both repository/metadata specialists and developers and your contributions so far have already taken us a long way.

You're totally correct that the image we (specifically I) developed to visualize the relationship between MIK's components is misleading, or at least inaccurate. I did that up early on in MIK's history, and it's due for an update, particularly in light of your feedback.

Thanks for suggesting the two strategies for making MIK easier to extend. I think I am favoring the first one, implemented via a stricter conformance to best practice OOP. At the same time, I'd like to review how we implement manipulators. I'm also wondering about the CSV and Cdm toolchains' use of field mappings, and whether using Twig templates throughout is a better way to go. However, we really benefited from fetcher and metadata manipulators during our migration, so I'd want to make sure that it's easy for non-developers to be able to achieve similar ends using configuration, not code.

I'm not really clear on what you mean by a "module-based approach" in your second suggestion. I'm reluctant to combine fetchers and filegetters at this point, but if we can't achieve the dream of making them interchangeable, we should consider combining them. But, I want to take the combinable "toolchain" idea to its pragmatic ends using better OOP before we decide we need to explicitly combine components.

Let's get PR #424 merged into master (and any other outstanding issues/PRs we need to) and then regroup on the questions you raise here. How does that sound?

MarcusBarnes commented 7 years ago

It is appropriate for us to make an official release (even if it's v0.9) before addressing this meta-issue, as major architectural changes could impact existing users of MIK, even if they are ultimately for the best all-round. Thanks!

mjordan commented 7 years ago

Also, @MarcusBarnes and I have been mulling over when we should create a 1.0 tagged release. Maybe we should consider doing that after we merge #424 and decide that it had no major side effects. We could then start thinking about what an MIK 2.0 would look like.

mjordan commented 7 years ago

Sorry @MarcusBarnes, didn't see your comment until after I posted my last one. I'm happy either way!

whikloj commented 7 years ago

You know I think that is what I was missing from the start. This is a toolkit for those that don't want (or can't) write their own. I think there is room to adjust to allow some more code re-use, but I think I was trying to use mik for something it is not. I'm happy to help out where I can, I think some more test coverage is a good idea. But for the stuff I am migrating I'm just going to write a python script as it a custom system so the migration is a one-off anyways.

mjordan commented 7 years ago

This is a toolkit for those that don't want (or can't) write their own.

Primarily, yes. We wrote it to support SFU's migration from CONTENTdm to Islandora, but since we had over 120 Cdm collections to migrate, we wanted to reduce the amount of work specific to each collection's migration down to a single configuration file and so we woudn't have to write 120 custom scripts! (In practice each collection took a lot more work to migrate, but not because of MIK, because CONTENTdm lets you shoot yourself in the foot 120 times if you wanted to.) However, we wrote MIK knowing that 1) other sites might want to use the tool for similar migrations, and 2) post-migration, SFU would be relying on CSV data as its primary "input" to Islandora.

That said, since we wrote MIK while we were migrating, we didn't always have time to plan out the best OOP approach and we were often reacting to surprises we only discovered after it was too late. Now that we're in post-migration production, we're still reacting..... but now is also a good time to step back and ask how we can make it a better piece of software.

If your migration is a one off, a custom script is probably a much easier and faster solution. But, if you can predict with some certainty that you'll need to migrate (or prepare for ingestion) the same type of content again, then your task may make a good candidate for an MIK "exension", whether it's a new fetcher, filegetter, or manipulator. Since we've only been outputting ingest packages that can be ingested using the standard Islandora batch modules plus compound batch, we haven't had much need to new writers, but the sample CSVToJSON toolchain is an example of how you might generate another type of "ingest package" for some other repository (maybe Samvera, maybe CLAW?). I'm looking forward to when we can start working on an Islandora 7.x to CLAW toolchain (incidental mention of @dannylamb).

MarcusBarnes / mik

Clarify toolchain inputs/outputs or make content modules [META ISSUE] #425