EMCECS / ecs-sync

ecs-sync is a bulk copy utility that can move data between various systems in parallel
Apache License 2.0
61 stars 22 forks source link

ECS-SYNC pulls metadata and CDF's, but not blobs #55

Closed CAS-mover closed 4 years ago

CAS-mover commented 5 years ago

I have the ECS-SYNC OVA running version 3.3.0 which I upgraded. I ran a "sudo yum update" as well before trying anything. When running a UI migration from my Centera to an NFS path, I am getting JSON/CDF files with the CAS checksum name, but the system is not pulling the blob files with it. As a result, I have a migration of 2500 objects that is 4MB, but should be around 290GB. I am able to retrieve the metadata as well, but don't need it for my purpose.

To debug, I have tried a single thread with 1GB of buffer, all the "experimental" settings, and even pulling to the local filesystem. Thread count, buffer size, experimental settings and target do not seem to matter. Verification does seem to fail when it's on. I am not getting permission errors on the targets. I installed another OVA of ECS-SYNC 3.2.7, in case I somehow messed up the upgrade, but that seems to make no difference. The debug logging shows that the object sizes are being detected correctly--they are just not fetched or saved (not exactly sure which).

As another test, I loaded the JCASScript on a Windows machine, and I am able to use the "clipToFile" function to save a blob file using the original filename from the CDF. I did a "query" to produce a clip list, and that runs in 5 minutes instead of 5 hours, but the result is the same--no blobs.

Does anyone have any suggestions for next steps? I am using the UI, and don't really want to mess with the CLI options. I don't see anything relevant in them that might help, but it's possible the UI doesn't have all features. I can post logs, but have to sanitize them first per company policy, which is why they aren't posted now.

twincitiesguy commented 5 years ago

CAS is a multi-blob protocol, so you cannot simply migrate it to a single-blob protocol like NFS or S3. Iff your source clips have a single blob, you can use the CasSingleBlobExtractor filter to extract the single blob as the object data. Otherwise, you can only migrate CAS data to another CAS system. If you try to migrate to NFS or filesystem without a blob extractor, you will end up with the CDF as the data because ecs-sync doesn’t make any assumptions about the blob to file mapping.

CAS-mover commented 5 years ago

Thank you for the explanation! That does explain what I am seeing during the migrations. The CDF data is stored with the CAS address as the filename, and the blob is not extracted. The CDF includes the original blob filename, and the metadata isn't really necessary.

The data I am extracting has a single blob per CAS object. We want the blobs back off the Centera for use in a filesystem or S3 bucket. Does the ECS-SYNC UI allow the CasSingleBlobExtractor filter you mentioned to be utilized to do that?

I will try setting up a Minio VM to try exporting to S3 directly. Does that extract the blobs? For what I am attempting, that is an extra step just to get the blobs back, but I'll give it a go if it will help.

twincitiesguy commented 5 years ago

In the UI, you can add a filter and select the CAS Single Blob Extractor. Then fill out any options and it should be able to write to any single-blob storage system, like NFS or S3.

CAS-mover commented 5 years ago

I did have some success with the filter (totally glossed over the option before) and the results are exactly what I expected. I have 11 files of the correct size out of 2400+ objects. In my example object, there was a single blob file. After trying a few more CDFs, it appears the CenteraUtils part of the Vendor's original software created multiple blobs for a single file to split the object size. This likely means the CasSingleBlobExtractor won't touch them.

Here is an example CDF from ECS-SYNC: <?xml version='1.0' encoding='UTF-8' standalone='no'?>

A single "StoreContentObject" contains multiple "eclipblob" entries, probably to keep them from being too large for the CAS API. If I write the CAS address to a file with the JCASSCRIPT tool, and the "clipToFile" command, the file is pieced back together as expected, and is the correct size. It appears that ECS-SYNC doesn't have a way to export the data in the same way? It's an amazing piece of software, and I understand if it can't do everything. I was mainly clarifying the situation to ask "should it work in this case?" I feel like I'm so close to having the data back, but it's not working.

I'm not a Java programmer, so it seems like the only way to retrieve the data is through the software which wrote it or through JCASSCRIPT manually? So close...

CAS-mover commented 5 years ago

I did have some success with the filter (totally glossed over the option before) and the results are exactly what I expected. I have 11 files of the correct size out of 2400+ objects. In my example object, there was a single blob file. After trying a few more CDFs, it appears the CenteraUtils part of the Vendor's original software created multiple blobs for a single file to split the object size. This likely means the CasSingleBlobExtractor won't touch them.

I have stopped trying to filter the "filename" parameter from the CDF, and am using the ClipID options in the "CasSingleBlobExtractor" filter. The blobs are coming out, and a spot-check shows they are the correct size! I will need to figure out how to rename them, but at least the data is moving into NFS! Renaming is a much easier problem to solve. I wish the filter above could use the "filename" parameter, but that is a matter for another day.