Islandora / documentation

Contains islandora's documentation and main issue queue.
MIT License
103 stars 71 forks source link

Storing fixity auditing outcomes #847

Open mjordan opened 6 years ago

mjordan commented 6 years ago

I would like to start work on fixity auditing (checksum verification) in CLAW. In 7.x, we have the Checksum and Checksum Checker modules, plus the PREMIS module, which serializes the results of Checksum Checker into PREMIS XML and HTML. Now is a good time to start thinking about how we will carry this functionality over to CLAW so that on migration, we can move PREMIS event data from the source 7.x to CLAW.

In 7.x, we rely on FCREPO 3.x's ability to verify a checksum. In a Drupal or server-side cron job, Checksum Checker issues a query to each datastream's validateChecksum REST endpoint (/objects/{pid}/datastreams/{dsID} ? [asOfDateTime] [format] [validateChecksum]) and we store this fixity event outcome in the object's AUDIT datastream. The Fedora API Specification, on the other hand, does not require validation of a binary resource's fixity but instead requires implementations to return a binary resource's checksum to the requesting client, allowing the checksum value to "be used to infer persistence fixity by comparing it to previously-computed values."

Therefore, in CLAW, to perform fixity validation, we need to store the previously-computed value ourselves. In order to ensure long-term durability and portability of the data, we should avoid managing it using implementation-specific features. Two general options for storing fixity auditing event data that should apply to all implementations of the Fedora spec are

Fixity event data can accumulate rapidly. The 7.x Checksum Checker module's README documents the effects of adding event outcome to an object's AUDIT datastream, but in general, each fixity verification event on a binary resource will generate one outcome, which includes the timestamp of the event and a binary value (passed/failed). For example, in a repository that contains 100,000 binary resources, each verfication cycle will generate 100,000 new outcomes that need to be persisted somewhere. In our largest Islandora instance, which contains over 600,000 newspaper page objects, we have completed 14 full fixity verification cycles, resulting in approximately 8,400,000 outcome entries.

I would like to know what people think are the pros and cons of storing this data both within the repository as RDF and external to the repository using the triplestore or a database. My initial take on this question is:

One possible mitigation against the loss of an RDBMS is to periodically dump the data as a text file and persist it into the repository; that way, if the database is lost, it can be recovered easily. The same strategy could be applied to data stored in the triplestore.

If we can come to consensus on where we should store this data, we can then move on to migration of fixity event data, implementing periodic validation ("checking"), serialization, etc.

ajs6f commented 6 years ago

Couple of simple thoughts:

In order to ensure long-term durability and portability of the data, we should avoid managing it using implementation-specific features.

+1000.

As for where to put it, if this checksum info is stored in support of durability, it should be treated at least as well as other durable information: stored in multiple places, for each operation (as opposed to occasional bulk updates from one location to another). Which location is used as an authoritative source seems to me mostly to depend on pragmatic considerations (i.e. how to keep the architecture simple and performant).

mjordan commented 6 years ago

Thinking this through a bit more, if we store event outcome data as RDF in either FCREPO or the triplestore, we'd need not just one triple but two for each event, the timestamp, and the outcome (passed/failed), assuming the trusted checksum value need only be stored once, not per verification event. So a verification cycle of 100,000 resources would result in 200,000 new triples.

If that's the case, maybe storing both pieces of info in one row in a database table is more efficient, replicating the info by periodically persisting the db table(s) into Fedora as a binary resource. Doing that at least ensures there are two copies, although not necessarily robustly distributed copies. But databases are pretty easy to replicate, so if we want distributed replication, that's also an option.

ajs6f commented 6 years ago

The problem seems to me to be that a fixity check is transactional information. But the pattern suggested here persists it long after the transaction is done. Why store the events at all? Why not publish them like any other events in the system, and if a given site wants to store them or do something else with them, cool, they can figure out how to do that together with other sites that share that interest. But why make everyone pay the cost of persisting fixity check events? Speaking only for SI, we certainly don't need or want to do that.

Does anything other than a single checksum per non-RDF resource actually need to be stored?

mjordan commented 6 years ago

Currently, in 7.x, enabling the Checksum and Checksum Checker modules is optional, and I'm not suggesting that similar functionality in CLAW is any different. Sorry I didn't state that explicitly. Any functionality in CLAW to generate, manage, and report on fixity auditing would be implemented as Drupal contrib modules.

We would want to store events so we can express the history of a resource in PREMIS (for example). In our use case, we want to be able to document that uninterupted history across migrations between repositories, from 7.x to CLAW.

dannylamb commented 6 years ago

I think what's getting wrapped up in here is the auditing functionality. If we just need to check fixity, stick it on as a Drupal field. It'll wind up in the drupal db, the triplestore, and fedora. If you want to persist audit events, I'd model that as a content type in drupal and it'll get persisted to all three as well by default. Of course, you could filter it with context and make it populate only what you want (e.g. just the triplestore and not fedora).

mjordan commented 6 years ago

@dannylamb I hadn't thought about modelling fixity events as a Drupal content type. One downside to doing that is adding all those nodes to Drupal. I'm concerned that over time, the number of events will grow very large, with dire effects on performance.

dannylamb commented 6 years ago

@mjordan And after thinking about this some more, if you're worried about performance, your best bet is usually something like a finely tuned postgres. Just putting it in Drupal, and not Fedora or the triplestore may be your best bet. I'd just be sure to snapshot the db. That's a perfectly acceptable form of durability if you ask me.

dannylamb commented 6 years ago

Ha, needed to refresh.

dannylamb commented 6 years ago

@mjordan Yes, that's certainly a concern. That threshold of "how much is too much for Drupal" is looming out there. It'd be nice to find out where that really is.

mjordan commented 6 years ago

I agree with @ajs6f's characterization of fixity verification as transactional, which is why I'm resisting modelling the events as Drupal nodes.

We should do some thorough scalability testing, for sure. Maybe we should open an issue for that now, as a placeholder?

dannylamb commented 6 years ago

I see what you're saying. It's not like you're going to be scouring that history all the time, so there's no point in having it bog down everything else. If it's too hamfisted to model them as nodes, then having a drupal module just to emit them onto a queue is super sensible. And sandboxing it to its own storage is even more so. As for what that is/should be?

I guess that depend on what you're going to do with it and how you want to access it. I presume you'd want to be able to query it? That at least narrows down the choice to either sql or the triplestore if you wanna stay with the systems already in the stack.

dannylamb commented 6 years ago

...or Solr.

mjordan commented 6 years ago

Yeah, we're going to want to query it. If we store the SHA1 checksum as a field in the binary node (which sounds like a great idea), we'll want to query the events to serialize them as PREMIS, for example ("give me all the fixity verification events for the resource X, sorted by timestamp would be nice").

seth-shaw-unlv commented 6 years ago

We weren't necessarily planning on using CLAW to manage fixity. I'm actually interested in what UMD proposed which includes using a separate graph in the triple-store specifically for audit data. Even if you were using the same Triple-store for both, placing them in separate graphs should preserve performance on the CLAW one.

ajs6f commented 6 years ago

I guess that depend on what you're going to do with it and how you want to access it.

Can't agree enough!

@seth-shaw-unlv Did you mean separate datasets? Because in most triplestores (depends a bit, but Jena is a good example) putting them in separate named graphs in one dataset isn't going to do anything for performance. (Putting one in the default graph and one in a named graph would do a little, but not anything much compared to putting them in separate datasets.)

Generally, my experience has been that in non-SQL stores (be they denormalized like BigTable descendants or "hypernormalized" like RDF stores) query construction makes the biggest difference in performance, and should dictate data layout.

@mjordan Sorry about the misunderstanding-- I thought you were talking about workflow to which every install would have to subscribe. Add-on/optional stuff, no problem!

seth-shaw-unlv commented 6 years ago

@ajs6f, yes, you are right. I was, admittedly, speaking based on an assumption that separate graphs would improve performance due to a degree of separation. I don't have experience scaling them yet .

ajs6f commented 6 years ago

@seth-shaw-unlv I think we're all going to learn a bunch in the next few years about managing huge piles of RDF!

mjordan commented 6 years ago

@seth-shaw-unlv the UMD strategy looks good, but it's specific to fcrepo. I think it's important that Islandora not rely on features of a specific Fedora API implementation. Also, I'm hoping that we can implement fixity auditing in a single Drupal module, without any additional setup (which is what we have in Islandora 7.x).

@ajs6f no problem, we're all so focussed on getting core functionality right that I should have made it clear I was shifting to optional functionality.

whikloj commented 6 years ago

I think the UMD plan could be simplified to:

  1. Do fixity (as defined in the Fedora API) on some scheduled process.
  2. Store fixity check result somewhere.*

I'd like to keep the processing of fixity off of the Drupal server if possible as this is a process that for large repositories could be always running.

mjordan commented 6 years ago

@whikloj yes I was starting to think about abstracting the storage out so individual sites could store it where they want. About keeping the processing off the Drupal server, you're right, the process would be running constantly. But I don't see where issueing a bunch of requests for checksums, then comparing them to the previous value, then persisting the results somewhere would put a huge load on the Drupal server. It's the Fedora server that I think will get the biggest hit, since, if my understanding is correct, it needs to read the resource into memory to respond to the request for the checksum. A while back I did some tests on a Fedora 3.x server to see how long it took to verify a checksum and found that "the length of time it takes to validate a checksum is proportionate to the size of the datastream"; I assume this is also true to a certain extent with regard to RAM usage although I didn't test for that.

mjordan commented 6 years ago

Following up on @whikloj suggestion of moving the fixity checking service off the Drupal server, would implementing it an external microservice be an option? That way, non-Islandora sites might be able to use it as well. Kind of complicates where the data is stored (maybe that could be abstracted such that Islandora stores it in Drupal db, Samvera stores it somewhere else, etc. Such a microservice could be containerizied if desired.

dannylamb commented 6 years ago

:+1: Doing it as a microservice will indeed abstract away all those details. The web interface you design for it will allow individual implementors to use whatever internal storage they want.

mjordan commented 6 years ago

Sounds like a plan - anyone object to moving forward on this basis? The "Islandora" version of this would be a module that consumed data generated by the microservice to provide reports, checksum mismatch warnings, etc.

dannylamb commented 6 years ago

The "Islandora" version of this would be a module that consumed data generated by the microservice to provide reports, checksum mismatch warnings, etc.

Reading this, my gut is telling me the microservice should stuff everything into its own SQL db and we point views at it in Drupal to generate reports/dashboards.

jonathangreen commented 6 years ago

I totally agree with the microservice idea for doing fixity checks.

Not sure if we should handle it in this issue, or in another issue, but one thing we are missing (and missing completely in 7.x) is the ability to provide a checksum on ingest, and have it verified once the object is in storage, failing the upload if the fixity check fails.

This is the most common fixity feature I'm asked for in Islandora 7.x, and it covers the statistically most likely case of the file getting mangled in transit, rather then when sitting on disk.

dannylamb commented 6 years ago

@jonathangreen We're halfway there on transmission fixity. We cover it on the way into Fedora from Drupal, but not from upload into Drupal. We can add that as a separate issue to add it to the REST api and wherever else (text field on upload form?).

jonathangreen commented 6 years ago

@dannylamb sounds good to me.

ajs6f commented 6 years ago

Just to be clear, this would be a service that produces checksums for the frontend via its own path to persistence, not a service to which the binaries are transmitted over HTTP for a checksum, right?

dannylamb commented 6 years ago

@jonathangreen https://github.com/Islandora-CLAW/CLAW/issues/867

mjordan commented 6 years ago

All sounds good. @jonathangreen can you open a JIRA ticket for the 7.x feature? I'm happy to work on it.

dannylamb commented 6 years ago

@ajs6f Shoulda cleared my cache. Yes, this is just to provide a checksum when uploading a file, not to produce them.

ajs6f commented 6 years ago

@dannylamb Cool, what I'm trying to rule out is:

  1. 2TB genomics file gets uploaded.
  2. 2TBGF then get retransmitted to this service, for checksumming on demand.
dannylamb commented 6 years ago

@ajs6f Word. That's the sort of thing that should only happen if you want it to happen.

ajs6f commented 6 years ago

That's the way

mjordan commented 6 years ago

Some lunchtime thoughts... I have a friend who may be interested in starting to hack such a microservice, just to get his feet wet.... he prefers to remain anonymous for now but let's call him Marvin Jordanski. Anyway, he's not sure whether to start his microservice using Silex, like the other CLAW microservices, or to use Symfony as suggested in #828. Sounds like it's Symfony from here on in, but anyone got any additional advice?

DiegoPino commented 6 years ago

My advice to Mr.Jordanski, silex is dead, so better to let it rest in peace. Symfony 4 would allow your (ups, i mean "his") micro-service to live way longer.

mjordan commented 6 years ago

Thanks @DiegoPino, I'll pass that advice on to him when I see him.

DiegoPino commented 6 years ago

@jonathangreen for the 7.x version Ticket, if feel it could be good put somewhere in that ticker, for whomever ends writing that, that some chunked transmissions implementations like plupload could have issues with a user provided hash.. via form (like where to put it and how to trigger-it or when to trigger it since assembling of the final upload is happening somewhere else...)

jonathangreen commented 6 years ago

@DiegoPino here is the ticket for 7.x if you want to add some notes: https://jira.duraspace.org/projects/ISLANDORA/issues/ISLANDORA-2261

mjordan commented 6 years ago

During 2018-07-11 CLAW tech call, @rosiel asked about checksums on CLAW binaries stored external to Fedora, e.g. stored in s3, Dropbox, etc. Getting Fedora to provide a checksum on these binaries could be quite expensive since it pulls down the content to run a fixity check on it. One idea that come up in the discussion was that if we are using an external microservice to manage/store fixity checking, we could set up rules to verify checksums on those remote binaries. The microservice would need to pull it down to do its check, but not if the storage storage service provided an API to get a checksum on a binary, our microservice could query that API.

DiegoPino commented 6 years ago

Or maybe use those service's native fixity check... like on S3 you are paying and getting that hash as tech metadata via the API?

El El mié, 11 de jul. de 2018 a las 13:46, Mark Jordan < notifications@github.com> escribió:

During 2018-07-11 CLAW tech call, @rosiel https://github.com/rosiel asked about checksums on CLAW binaries stored external to Fedora, e.g. stored in s3, Dropbox, etc. Getting Fedora to provide a checksum on these binaries could be quite expensive since it pulls down the content to run a fixity check on it. One idea that come up in the discussion was that if we are using an external microservice to manage/store fixity checking, we could set up rules to verify checksums on those remote binaries. The microservice would need to pull it down to do its check, but not if the storage storage service provided an API to get a checksum on a binary, our microservice could query that API.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/Islandora-CLAW/CLAW/issues/847#issuecomment-404254674, or mute the thread https://github.com/notifications/unsubscribe-auth/AGn852ZLarhLJdX9H2XILdGGENMVILkZks5uFjn8gaJpZM4UyQRp .

-- Diego Pino Navarro Digital Repositories Developer Metropolitan New York Library Council (METRO)

mjordan commented 6 years ago

@DiegoPino yes, that's what I meant. Sorry if that was not clear.

DiegoPino commented 6 years ago

Sorry, my fault! Re-reading and totally agree, sorry again @mjordan

jonathangreen commented 6 years ago

I'd like to see a way here to have a trust but verify approach, where you can pull checksums from an external API, but maybe at some lower frequency you still want to pay for the bandwidth to do some verification of the checksums. Could just be some configuration options.

DiegoPino commented 6 years ago

@jonathangreen i agree it could be useful to provide a more compliant to what is expected preservation platform, but in terms of implementation, how would you propose we have no false positives of corruption because of timed out/stalled or even failed downloads? Not something that keeps me awake at night right now, but http(s) which is what most API provide for downloading assets tends to be hit and miss in that aspect. As said, i agree this is needed, just don´t know how to deal with it at an implementation level in a safe and reliable way

mjordan commented 6 years ago

One approach to handling false mismatches is to retry the request if it fails, and see what the results are. A one-off failure can be discarded, but if they all fail, the problem is probably legit.

mjordan commented 6 years ago

Not to increase scope here too much, but keep the use and edge cases coming. I'll be working on this (and related preservation stuff in CLAW) pretty much full time this fall.

ajs6f commented 6 years ago

This is starting to sound like an application on top of something like iRODS. I'm not seriously suggesting that; I'm wondering whether for the MVP, would it be enough to have a simple µservice that just retrieves a checksum from the backend in use, on the assumption that such a checksum is available?

I'm not at all trying to discourage people from recording use cases and I think it's awesome that @mjordan is thinking through this, it's just that when you're facing problems like: the character of reliable transport for mass data across networks of unknown character... that's a pretty big scope.

mjordan commented 6 years ago

@ajs6f MVP is a good way to frame it. I don't have the cycles to propose one this week but need to prepare a poster for iPres so will need to do that soon (next couple weeks?). We can build it in a way that can expand on the MVP.

rosiel commented 6 years ago

Maybe this could be one consideration in a repo manager's choice of storage solution. Since now we have all these options, we're going to need to make educated decisions on which one to use. Just because something's cool, doesn't mean it's the right tool for your job, and if you need reliable, regular, automated, locally-performed checksumming (and maybe that's a preservation best practice?) then S3 might not be the ideal storage location for you?