EMCECS / ecs-sync

ecs-sync is a bulk copy utility that can move data between various systems in parallel
Apache License 2.0
61 stars 22 forks source link

ECSSYNC keeps crashing #93

Closed Faizel2 closed 3 months ago

Faizel2 commented 1 year ago

ECSSync keeps crashing with the error below during a CAS migration. upgraded to the latest version. Anything above one thread seems to crash the session. Sufficient memory and resources available.

Please advise

Error in `java': free(): invalid next size (normal): 0x00007f8518004490 ======= Backtrace: ========= /lib64/libc.so.6(+0x7c619)[0x7f864e7a3619] /usr/local/Centera_SDK/lib/64/libFPStreams64.so(_ZN20FPBasicGenericStream13prepareBufferEv+0x1c2)[0x7f85e4f1cd42] /usr/local/Centera_SDK/lib/64/libFPCore64.so(_ZN22HPPReadBlobTransaction3runEv+0x4a1)[0x7f85e46beab1] /usr/local/Centera_SDK/lib/64/libFPCore64.so(_ZN7Cluster8readBlobEP20FPBasicGenericStreamR10FPClipGuidR10FPBlobGuidllllR13FPClipContextR12FPTagContext+0x1f7)[0x7f85e465ebb7] /usr/local/Centera_SDK/lib/64/libFPCore64.so(_ZN12ClusterCloud8readBlobEP20FPBasicGenericStreamR10FPClipGuidR10FPBlobGuidR13FPClipContextR12FPTagContextllll+0x1de)[0x7f85e466a19e] /usr/local/Centera_SDK/lib/64/libFPCore64.so(_ZN5FPTag8BlobReadER20FPBasicGenericStreamlli+0x9f0)[0x7f85e46ab810] /usr/local/Centera_SDK/lib/64/libFPCore64.so(_Z22_FPTag_BlobReadPartialP5FPTagP20FPBasicGenericStreamlll+0x15)[0x7f85e4688ec5] /usr/local/Centera_SDK/lib/64/libFPLibrary64.so.3.4.757(FPTag_BlobReadPartial+0xb0)[0x7f85e5176640] /usr/local/Centera_SDK/lib/64/libFPLibrary64.so.3.4.757(Java_com_filepool_natives_FPLibraryNative_FPTag_1BlobReadPartial+0x94)[0x7f85e51a06a4] [0x7f8639f2bdd9] ======= Memory map: ======== 00400000-00401000 r-xp 00000000 fd:00 269292636 /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.161-0.b14.el7_4.x86_64/jre/bin/java 00600000-00601000 r--p 00000000 fd:00 269292636 /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.161-0.b14.el7_4.x86_64/jre/bin/java 00601000-00602000 rw-p 00001000 fd:00 269292636 /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.161-0.b14.el7_4.x86_64/jre/bin/java 00f61000-00f82000 rw-p 00000000 00:00 0 [heap]

xiaoxin-ren commented 1 year ago

Hi @Faizel2 , the crash above is inside CAS lib, not much clue to indicate whether it's an ecssync isue, CAS lib issue or CAS platform issue. Did it work before you upgrade ecssync? There's no code change related to CAS in recent releases, so it's not straightforward to tell whether it could be a regression. If you look into the errors in /var/log/ecs-sync/ecs-sync.log, you should see which objects hit failure. Does it work if you use CAS tool alone? You'll need to identify whether it's a read issue in source storage or write issue in target storage. If you are using professional service to do the migration, please ask PS to follow up the troubleshooting. Otherwise, please open a service ticket upon the migration platform so that they can assist to narrow down issue and route it to the right person for further help.

Faizel2 commented 1 year ago

Hi Ren We had the same issue before the upgrade from 3.2.7 to .3.5.2. How can we use the CAS Tool alone to verify. Professional services is involved, we opened a Service Ticket with ECS Support but they cannot assist. there is no option for the migration platform , please advise how to do this.

xiaoxin-ren commented 1 year ago

So this is not regression in Ecssync. A few options to proceed next:

  1. PS should have knowledge to use the CAS tool, what's their investigation result? Is it CAS lib deployment issue, or storage server issue(Centera, ECS)?
  2. The CAS lib is provided by Centera, so it's recommended to get a Service Ticket upon Centera if it's related to CAS lib or CAS read from source storage.
  3. ECS support will be able to assist you to identify any access issue related to CAS bucket in ECS. Please note that Ecssync is an open source project, out of the ECS customer support scope.
  4. We'll help here if the issue is narrowed down to be related to ecssync.
xiaoxin-ren commented 1 year ago

@Faizel2, I see PS is currently investigating the issue. The provided ecs-sync log shows that CAS SDK and ECSSync works fine. It's a random crash in CAS SDK, not related to specific object, Please wait for further update from PS.

holgerjakob commented 1 year ago

Is there an update on this? We have some large multi billion CAS Migrations upcoming from Gen3 to EXF900. Just so that we prepare an ECSSync VM and hopefully avoid this issue

xiaoxin-ren commented 1 year ago

@holgerjakob, can you please reach PS or customer support for an update? Engineer investigated the issue and found out that the crash was caused by a scenario of syncing huge blob(4-5GB) running with default 16 threads. The workaround is to sync again by increased memory with reduced threads. You can tune the thread setting back after the hug blobs are successfully copied. The investigation was done on Nov 10, 2022. I'm curious what caused the communication gap?

holgerjakob commented 1 year ago

Hi Ren Thats good to hear. We configure VMs with at least 32 GB of Memory, more often 64 and set the Xmx memory Parameter accordingly. That was not me who opened the ticket. Just sorting things out prior to using ECSSync on some upcoming large migrations to EXF900.

Thanks for responding Holger

holgerjakob commented 1 year ago

Dear all

We did invest some time to come up with a new install guide. In case you are interested https://www.backup.ch/wp-content/uploads/sites/5/ECS-Sync-Installation-V1.0.pdf

Adjusting the memory parameter is in it. If access to any of the files is not working we can provide links to them.

Take care, Holger

dunedodo commented 3 months ago

As it has been a long time since the last response, I'll close the ticket. Please feel free to reopen it if you hit the same issue.