EMCECS / ecs-sync

ecs-sync is a bulk copy utility that can move data between various systems in parallel
Apache License 2.0
60 stars 22 forks source link

Centera -> ECS CAS migration - failure #68

Closed evergreek closed 3 years ago

evergreek commented 4 years ago

job_config.txt

Hello,

I kicked off a job with 3.4 million clips via a CLIPLIST for a migration from Centera->ECS CAS. After about 18 hrs -- transferred 1.8 million clips - the job seems to have crashed and the GUI shows the following:

Service Not Running The ecs-sync service does not appear to be running. Please check that the service is installed correctly and successfully started.

an interesting note from the ecs sync log..

======= Backtrace: ========= /lib64/libc.so.6(+0x7c619)[0x7f42cd479619] /usr/local/Centera_SDK/lib/64/libFPStreams64.so(_ZN20FPBasicGenericStream8completeEv+0x1a2)[0x7f429e093ad2] /usr/local/Centera_SDK/lib/64/libFPCore64.so(_ZN23HPPWriteBlobTransaction3runEv+0x4f6)[0x7f429d837776] /usr/local/Centera_SDK/lib/64/libFPCore64.so(_ZN7Cluster9writeBlobEP20FPBasicGenericStreamP19FPBasicStringBufferS3ilR12FPObjectGuidlR13FPClipContextR12FPTagContextPK19MigrateTagInterfacePK9FPHashMapSF+0x3f6)[0x7f429d7d62b6] /usr/local/Centera_SDK/lib/64/libFPCore64.so(_ZN12ClusterCloud9writeBlobEP20FPBasicGenericStreamP19FPBasicStringBufferS3ilR12FPObjectGuidlR13FPClipContextR12FPTagContextPK9FPHashMapSC+0x57c)[0x7f429d7e38ac] /usr/local/Centera_SDK/lib/64/libFPCore64.so(_ZN5FPTag11restoreBlobER20FPBasicGenericStreaml+0x6ef)[0x7f429d82590f] /usr/local/Centera_SDK/lib/64/libFPCore64.so(_ZN5FPTag9BlobWriteER20FPBasicGenericStreamlli+0xc1d)[0x7f429d82b61d] /usr/local/Centera_SDK/lib/64/libFPCore64.so(_Z16_FPTag_BlobWriteP5FPTagP20FPBasicGenericStreaml+0x2e)[0x7f429d80012e] /usr/local/Centera_SDK/lib/64/libFPLibrary64.so.3.4.757(FPTag_BlobWrite+0x5e)[0x7f429e2edade] /usr/local/Centera_SDK/lib/64/libFPLibrary64.so.3.4.757(Java_com_filepool_natives_FPLibraryNative_FPTag_1BlobWrite+0x72)[0x7f429e3179b2] [0x7f42b5aca34e]

I have attached the last few lines from ecs_sync_log and cas-sdk (I removed the CLIPID references) logs along with my config.

It was a pretty straightforward job with "Estimation" turned off for performance reasons..

I noticed that it started to hit clips of about 100MB in size.. for example..

Size:                101020862
    Number of Tags:      67
    Number of Blobs:     1

Do I need to enable any special settings due to the amount of tags/blobs?

Is there a way to resume the job? from where it failed? how can I identify the resume point?

will I be able to see a report of the ones that did transfer along with a verification report?

cas-sdk.log sdk config.txt ecs_sync_log.txt

holgerjakob commented 4 years ago

Hi restart the vm or the sync service to make sure both the ui and the service are running.

As the SDK Log shows, you are getting -10021 errors. This is an FP_CLIP_NOT_FOUND_ERR Error. The cliplist apparently has entries that cannot be found on the source cluster and on first sight I would not assume that this has anything to do with ECSSync but with your cliplist. Try (re)running the job to completion and make sure to specify the db table name so that it does not get deleted. Then select the source_id from the table where the status is error and create a new job to read these clips from the secondary Centera.

Without understanding how you created your cliplist it's difficult to say a lot more than this at the moment. If you want, reach out to holger.jakob@informatio.ch

Best regards, Holger

evergreek commented 4 years ago

Holger,

The clip list was generated straight from the application database. I checked the clips with JCASSCRIPT and they do exist on the source.

I wonder if these are false errors when it tries to check if it exists on the "target" ECS side? before sending it over?

I did a dump of the ECS SYNC DB and found a couple of clips that were in verification/in progress status...

I then created a second job with ONLY these clips (11 or so).. and they transferred successfully.

The problem I have is.. I have another 1.5 million clips to go..

What is interesting.. the job errored out when it started to hit clips ~100MB in size.. but this is still very small IMO. Not sure if Im getting some sort of network problem... or CAS limitations..

holgerjakob commented 4 years ago

Hi You can re-run a job, this will retry the errored out clips. If the service crashes and objects remain in transfer or verification then update the status to error or to null and these will in a re-run be tretried also.

Try the large blob count and and no drain blobs on error flags to figure out if these work better.

We've migrated any kind of objects from several k to GB sized objects. Pay attention to the number of connections on the Access Nodes as well as the CPU load. Slowly increase the thread count on ECSSync, each thread roughly results in three Centera Connections when I last checked.

I normally start with the vm, change the ip if required, increase memory, disk size and CPU. Then ECSSync does generally work well. If you have advanced retention management in use, you must use the native transformation.

Best regards, Holger

evergreek commented 4 years ago

Looks like it opens the clip fine.. just when the clipopen finishes does it error out..

1591630355627 2020-06-08 15:32:35.627 [log] 5083.-1033476352 [API] Start FPClip_Open(-,XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX,2) 1591630355631 2020-06-08 15:32:35.631 [log] 5083.-1032423680 [API] Start FPStream_CreateGenericStream(-,-,-,-,-,-) 1591630355631 2020-06-08 15:32:35.631 [log] 5083.-1032423680 [API] End FPStream_CreateGenericStream(-,-,-,-,-,-) 1591630355631 2020-06-08 15:32:35.631 [log] 5083.-1032423680 [API] Start FPStream_GetInfo(-) 1591630355631 2020-06-08 15:32:35.631 [log] 5083.-1032423680 [API] End FPStream_GetInfo(-) 1591630355631 2020-06-08 15:32:35.631 [log] 5083.-1032423680 [API] Start FPClip_RawRead(-,-) 1591630355632 2020-06-08 15:32:35.632 [log] 5083.-1032423680 [API] End FPClip_RawRead(-,-) 1591630355632 2020-06-08 15:32:35.632 [log] 5083.-1032423680 [API] Start FPStream_CloseWithoutClearError(-) 1591630355632 2020-06-08 15:32:35.632 [log] 5083.-1032423680 [API] End FPStream_CloseWithoutClearError(-) 1591630355632 2020-06-08 15:32:35.632 [log] 5083.-1032423680 [API] Start FPPool_GetLastError() 1591630355632 2020-06-08 15:32:35.632 [log] 5083.-1032423680 [API] End FPPool_GetLastError() --> [0] 1591630355632 2020-06-08 15:32:35.632 [log] 5083.-1032423680 [API] Start FPClip_GetTotalSize(140671588388272) 1591630355632 2020-06-08 15:32:35.632 [log] 5083.-1032423680 [API] End FPClip_GetTotalSize(140671588388272) -> 100192410 1591630355632 2020-06-08 15:32:35.632 [log] 5083.-1032423680 [API] Start FPPool_GetLastError() 1591630355632 2020-06-08 15:32:35.632 [log] 5083.-1032423680 [API] End FPPool_GetLastError() --> [0] 1591630355632 2020-06-08 15:32:35.632 [log] 5083.-1032423680 [API] Start FPClip_GetCreationDate(-,-,256) 1591630355632 2020-06-08 15:32:35.632 [log] 5083.-1032423680 [API] End FPClip_GetCreationDate(-,-,24) 1591630355632 2020-06-08 15:32:35.632 [log] 5083.-1032423680 [API] Start FPPool_GetLastError() 1591630355632 2020-06-08 15:32:35.632 [log] 5083.-1032423680 [API] End FPPool_GetLastError() --> [0] 1591630355632 2020-06-08 15:32:35.632 [log] 5083.-1032423680 [API] Start FPClip_FetchNext(-) 1591630355632 2020-06-08 15:32:35.632 [log] 5083.-1032423680 [API] End FPClip_FetchNext(-) 1591630355632 2020-06-08 15:32:35.632 [log] 5083.-1032423680 [API] Start FPPool_GetLastError() 1591630355632 2020-06-08 15:32:35.632 [log] 5083.-1032423680 [API] End FPPool_GetLastError() --> [0] 1591630355632 2020-06-08 15:32:35.632 [log] 5083.-1032423680 [API] Start FPTag_BlobExists(-) 1591630355633 2020-06-08 15:32:35.633 [error] 5083.-1033476352 [API] End FPClip_Open(-,XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX,2): Error -1002 1

Do you specify all your access node IPs on the connection string? or you rely on 1 IP and let the SDK probe for the additional ones?

holgerjakob commented 4 years ago

I typically use two so that even if one access node does not respond the job can start and connect. Afterwards the API will do load balancing and failover across all access nodes anyway.

twincitiesguy commented 4 years ago

Note that CAS migrations can be very memory intensive, depending on the thread count, clip size, and (believe it or not), the Centera cluster size. We have seen instances where migrations need upward of 32GB. Also, we have seen that keeping the clip list under about 4 million is a good idea, although that doesn't seem to be the problem in your case. And lowering thread count can also help. This seems consistent with @holgerjakob 's suggestions.

If you increase the memory on the VM, you will also have to edit the /etc/init.d/ecs-sync script and increase the -Xmx argument passed to java (this should be 4GB less than the physical RAM). Then restart the ecs-sync service (sudo systemctl restart ecs-sync).

evergreek commented 4 years ago

I have 16GB on this VM - it looks like it is barely utilizing any memory.

I lowered the job from 16 to 8 threads.. should I mess with the drain blob size / count settings if the job fails again? How low can I go on the streams?

This is connecting to a "small" Centera - only 4 access nodes.. and to an ECS with 8 nodes...

cas migration

twincitiesguy commented 3 years ago

closing due to inactivity