MarcusBarnes / mik

The Move to Islandora Kit is an extensible PHP command-line tool for converting source content and metadata into packages suitable for importing into Islandora (or other digital repository and preservations systems).
GNU General Public License v3.0
34 stars 11 forks source link

What are the required functions of a "FileGetter/FileManipulator/InputValidator/etc" #400

Closed whikloj closed 7 years ago

whikloj commented 7 years ago

This is a semi-documentation, semi-codebase question.

I am looking to to use MiK to move our Manitobia TEI stuff into Islandora. I went in to look at how to extend the base classes and they don't seem to have any abstract required functions that would be called?

I was expecting something more like an Interface which defines the required methods which each implementation would then implement. Say (for example) all Getters must implement a getFiles method which takes a... string for location and returns an array of files? Or maybe more generally it must implement a doGet function which can take the configuration and return the file information.

So what methods do I need to implement in my new Getter if I want it to work within the MiK framework? Or is it rather that my custom Getter can completely change my workflow?

Sorry I didn't get to attend the workshop at IslandoraCon 😞

MarcusBarnes commented 7 years ago

@whikloj You are correct in that the fileGetter abstract class is vary bare: https://github.com/MarcusBarnes/mik/blob/master/src/filegetters/FileGetter.php This is in part to allow maximum flexibility for different workflows. Maybe you can write me about where your TEI data is stored and which solution pack you'll be targeting for use in Islandora? I can then suggest a toolchain that you can use as a basis for your TEI project.

whikloj commented 7 years ago

It is not stored in a custom system, so it is fairly easy to access from the filesystem.

I just need to grab the TEI files and Tiffs. I have some that will go to Large Image (maps, photos), some for compound (postcards, reports) and some for books (diaries).

mjordan commented 7 years ago

@whikloj developer docs are one of MIK's most glaring gaps. We've known this since #14, way back in May 2015. and it's been on our 1.0 release roadmap since the beginning. :disappointed: That said, if @MarcusBarnes 's advice doesn't yield a useful existing toolchain, we can work through your task with you.

whikloj commented 7 years ago

@MarcusBarnes @mjordan Thanks for the responses and I'm certainly not complaining, it looks to be a really useful tool.

But as the barrier to access (none) and the number of items is relatively small (couple 1000) in my migration. (We are already using the standard uofm_newspaper_batch module to pull in the newspapers).

So I'm on the border of it being faster to just write a python script to reformat these things, but at the same time MiK seems very interesting to me. I'm just getting into this so I was unclear how the tool would know to run function(X) in my new TEIGetter class.

As a developer, I would suggest that an abstract class without any abstract methods should really be removed. It gives the impression that the classes are all extending a single class, when in reality they can all be totally unrelated.

whikloj commented 7 years ago

Looking through the FileGetters it appears that getChildren() is in all of them and getFilePath() is in most (but not all) of them. In the cases where they don't have getFilePath() how do they know what gets executed?

This is just something I noticed and you can ignore it, but you have several ContentDM classes that have the same __construct. That seems like it is a perfect place for a sub-class. Then you could just call the parent constructor.

mjordan commented 7 years ago

As a developer, I would suggest that an abstract class without any abstract methods should really be removed. It gives the impression that the classes are all extending a single class, when in reality they can all be totally unrelated.

@whikloj That's a totally reasonable impression on your part.

@MarcusBarnes 's point about allowing the most diverse implementations is correct, but we probably overloaded that ability early on and didn't pay attention to it as the number of FileGetter classes grew. For example, all child classes have a getChildren() method:

CdmBooks.php:    public function getChildren($pointer)
CdmCompound.php:    public function getChildren($pointer)
CdmNewspapers.php:    public function getChildren($pointer)
CdmPdfDocuments.php:    public function getChildren($pointer)
CdmSingleFile.php:    public function getChildren($pointer)
CsvBooks.php:    public function getChildren($record_key)
CsvCompound.php:    public function getChildren($record_key)
CsvNewspapers.php:    public function getChildren($record_key)
CsvSingleFile.php:    public function getChildren($pointer)
OaipmhIslandoraObj.php:    public function getChildren($record_key)
OaipmhIslandoraObj.php.orig:    public function getChildren($record_key)
OaipmhOjsPdf.php:    public function getChildren($record_key)
OaipmhVital_2.php:    public function getChildren($record_key)
OaipmhXpath.php:    public function getChildren($record_key)

All FileGetter classes that deal with single file objects (CsvSingleFile, etc.) have a getFilePath() method

CsvSingleFile.php:    public function getFilePath($record_key)
OaipmhIslandoraObj.php:    public function getFilePath($record_key)
OaipmhIslandoraObj.php.orig:    public function getFilePath($record_key)
OaipmhOjsPdf.php:    public function getFilePath($identifier)
OaipmhVital_2.php:    public function getFilePath($record_key)
OaipmhXpath.php:    public function getFilePath($record_key)

and then it kind of breaks down from there to accommodate the specific needs of the toolchain:

CsvCompound.php has getCpdSourcePath() CdmBooks.php has getPageOBJfileContent() CsvBooks.php has getBookSourcePath()

etc.

mjordan commented 7 years ago

@whikloj sorry I posted my last comment before seeing your latest one. Yes, would make sense. I'm not sure we have the appetite to refactor the Cdm classes at this point since we don't have unit tests for those (we never got around to mocking a CONTENTdm server....) and they were certainly written on a fast timeline, so they are not perfect OOP code.

mjordan commented 7 years ago

@whikloj to answer your question about what fires the correct method in the FileGetters, it's whatever is within the toolchain's Writer, specifically, within the Writer's writePackages() method. We started developer docs to explain all this in the "Writing fetchers, file getters, metadata parsers, and writers" section of https://github.com/MarcusBarnes/mik/wiki/Information-for-developers, but never got very far.

MarcusBarnes commented 7 years ago

@mjordan I'm expecting to have an opportunity to use the CONTENTdm classes again. If things come together properly, I'll aim to do the simple refactoring of the CONTENTdm classes as suggested in the second part of https://github.com/MarcusBarnes/mik/issues/400#issuecomment-309773381 as I'll have access to a CONTENTdm instance to test against.

whikloj commented 7 years ago

Oh okay, so the Writer is what calls stuff in the Getter and the Writer.php is an abstract class with required methods so that would be the entry point to the workflow. Your writer uses your getter to get files to write out. Am I close?

MarcusBarnes commented 7 years ago

@whikloj That's the idea. For certain toolchains (particularly CONTENTdm toochains), a great deal of compilation needs to take place before everything is in place to write out packages.

mjordan commented 7 years ago

@whikloj I've just added a section to https://github.com/MarcusBarnes/mik/wiki/Information-for-developers, directly at the top of the "Extending MIK" section, that explains MIK's overall workflow and the importance of the Writer's writePackages() method. I know you are asking for class API documentation, but does the new section fill in the context a bit?