EMCECS / ecs-sync

ecs-sync is a bulk copy utility that can move data between various systems in parallel
Apache License 2.0
60 stars 22 forks source link

Source MD5 field - blank for CAS to CAS migration? #61

Closed evergreek closed 4 years ago

evergreek commented 4 years ago

Hello,

I just tried to move 600 or so clips from Centera to ECS -- and I selected the "Verify" option - however, on the report that is generated for the job - "Source MD5" is blank - yet it populates the "Verify Start" "Verify Complete" columns. This is on 3.3.0 utilizing the GUI to run the job.

Any ideas?

On a side note - anyone has experience on copying 1.5 million (65TB) or so clips.. is it better to use a cliplist or let it query the DB?

twincitiesguy commented 4 years ago

The source_md5 field will not be populated for CAS objects. This is because the MD5 used to compare CAS data is actually a block of checksums (1 for the clip CDF and 1 for each blob), so it is too long to fit in the column. However, be assured that if the status in the DB says “Verified”, then the MD5 in source and target match.

To answer your last question, for CAS, a clip list is generally preferred. This is mainly because the act of generating the clip list can often reveal issues (if any) in the source pool and also gives you something to reference against an application database or expected count. If there are any discrepancies, you can reconcile them early (I.e. orphaned clips that were not cleaned up properly, test data, etc.)

evergreek commented 4 years ago

Is it possible to see an example on the DB? My customer is asking for "proof" that the checksum is indeed happening.

twincitiesguy commented 4 years ago

Unfortunately, no. As I mentioned in the comment above, the checksum blob is actually a list of MD5s. This list is variable length, depending on the number of tags and blobs attached to the clip. You will only see the blob if there is a mismatch, in which case, you can compare the two and see which individual tags/blobs are different and perhaps manually correct them, or fix the clip in the source and re-copy.

I don’t want to reduce the checksum list to a single MD5 because that would not allow you to see which blob is different. But I can’t store the entire list in the DB because it could be too big to fit and cause DB errors that would end up failing the clip. Therefore, the MD5 for CAS clips is not stored anywhere.

If you increase the log level to debug, all of the clip MD5s would get printed in the log… but that will generate very big logs and probably fill up the disk, causing several other problems. If none of these options are good enough, I guess you could find a way to tailor the log4j configuration to turn on debug just for the Md5Verifier class. That might take some research though.