dCache / dcache

dCache - a system for storing and retrieving huge amounts of data, distributed among a large number of heterogenous server nodes, under a single virtual filesystem tree with a variety of standard access methods
https://dcache.org
276 stars 133 forks source link

[Tape REST API] Stage call failed: HTTP 500 #7508

Open cfgamboa opened 5 months ago

cfgamboa commented 5 months ago

Hello,

It was reported to me the following transfer failure involving the TAPE REST API

https://fts3-atlas.cern.ch:8449/fts3/ftsmon/#/job/7c956118-c58c-11ee-9d7e-fa163e5a92fb

Error reason: STAGING [22] [Tape REST API] Stage call failed: HTTP 500 : Unexpected server error: 500 : {"detail":"Internal error: java.lang.NullPointerException","title":"Internal Server Error","status":"500"}

The file at the storage element is available during this transfer, other transfers successfully retrieve the file

02.07 04:11:36 [door:WebDAV2-dcdoor31-internalipv6@webdav2-dcdoor31_httpsDomain:request] ["usatlas4":6439:69907:2620:0:210:9800:0:0:0:37] [0000D0A4CA2616BA43618DEAD6188496E3C3,2718524222] [/pnfs/usatlas.bnl.gov/MCTAPE/atlasgrouptape/mc12_8TeV/NTUP_JETMET/e1126_s1469_s1470_r3542_r3549_p1344/mc12_8TeV.147913.Pythia8_AU2CT10_jetjet_JZ3W.merge.NTUP_JETMET.e1126_s1469_s1470_r3542_r3549_p1344_tid01115537_00/NTUP_JETMET.01115537._000467.root.1] MCTAPE:MC@osm 1590 0 {0:""}

All the best, Carlos

lemora commented 5 months ago

Hi Carlos, could you please try to find and post the entire stack trace of the thrown NPE? Also, just to make sure: which dCache version are you running? Thanks. Lea

cfgamboa commented 5 months ago

Hi Lea,

9.2.6 dCache java 11

I will post NPE related traces if produced.

All the best, Carlos

vingar commented 4 months ago

With dcache 9.2.11:

#curl -L  --capath /etc/grid-security/certificates --cert  /tmp/x509up_u0 --cacert  /tmp/x509up_u0 --key /tmp/x509up_u0 -X POST https://dcfrontend.usatlas.bnl.gov:3880/api/v1/tape/stage -H "Content-Type: application/json" -d '{"files": [ {"diskLifetime":"P1D","path":"/pnfs/usatlas.bnl.gov/MCTAPE/atlasgrouptape/mc11_7TeV/NTUP_TOP/e2275_a186_s1571_a145_r2993_p937/mc11_7TeV.110198.McAtNloHerwigpp_UEEE4_MRSTMCal_CT10ME_ttbar_lnulnu.merge.NTUP_TOP.e2275_a186_s1571_a145_r2993_p937_tid01350830_00/NTUP_TOP.01350830._000087.root.3"}]}'
{"detail":"Internal error: java.lang.NullPointerException","title":"Internal Server Error","status":"500"}

gives:

==> /var/log/dcache/dcfrontend01FrontGplazDomain.log <==
12 Mar 2024 10:34:56 (Frontend-dcfrontend01) [] Uncaught exception in thread jetty-458571
java.lang.NullPointerException: null
    at org.dcache.restful.resources.tape.StageResources.submit(StageResources.java:296)
    at jdk.internal.reflect.GeneratedMethodAccessor377.invoke(Unknown Source)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
    at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:52)
    at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:124)
    at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:167)
    at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:176)
    at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:79)
    at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:469)
    at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:391)
    at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:80)
    at org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:253)
    at org.glassfish.jersey.internal.Errors$1.call(Errors.java:248)
    at org.glassfish.jersey.internal.Errors$1.call(Errors.java:244)
    at org.glassfish.jersey.internal.Errors.process(Errors.java:292)
    at org.glassfish.jersey.internal.Errors.process(Errors.java:274)
    at org.glassfish.jersey.internal.Errors.process(Errors.java:244)
    at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:265)
    at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:232)
    at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:679)
    at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:392)
    at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:346)
    at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:365)
    at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:318)
    at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:205)
    at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:799)
    at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:554)
    at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)
    at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440)
    at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188)
    at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:505)
    at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186)
    at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355)
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
    at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:59)
    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
    at org.dcache.http.AuthenticationHandler.access$001(AuthenticationHandler.java:55)
    at org.dcache.http.AuthenticationHandler.lambda$handle$0(AuthenticationHandler.java:157)
    at java.base/java.security.AccessController.doPrivileged(Native Method)
    at java.base/javax.security.auth.Subject.doAs(Subject.java:361)
    at org.dcache.http.AuthenticationHandler.handle(AuthenticationHandler.java:154)
    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
    at org.dcache.http.AbstractLoggingHandler.handle(AbstractLoggingHandler.java:110)
    at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:59)
    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
    at org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:322)
    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
    at org.eclipse.jetty.server.Server.handle(Server.java:516)
    at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487)
    at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732)
    at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479)
    at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)
    at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
    at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
    at org.eclipse.jetty.io.ssl.SslConnection$DecryptedEndPoint.onFillable(SslConnection.java:555)
    at org.eclipse.jetty.io.ssl.SslConnection.onFillable(SslConnection.java:410)
    at org.eclipse.jetty.io.ssl.SslConnection$2.succeeded(SslConnection.java:164)
    at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
    at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
    at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:338)
    at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:315)
    at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173)
    at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131)
    at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:409)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883)
    at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)
    at java.base/java.lang.Thread.run(Thread.java:829)

and

==> /var/log/dcache/dcfrontend02Domain.log <==
12 Mar 2024 10:34:56 (bulk) [] 044134c0-48a3-44a4-9b8c-f543d067917f - fetchAttributes, callback failure for TARGET [15393521, INITIAL, null][null][CREATED: (C 2024-03-12 10:34:56.862)(S null)(U 2024-03-12 10:34:56.862)(ret 0)][null] null : /fs/usatlas.bnl.gov/MCTAPE/atlasgrouptape/mc11_7TeV/NTUP_TOP/e2275_a186_s1571_a145_r2993_p937/mc11_7TeV.110198.McAtNloHerwigpp_UEEE4_MRSTMCal_CT10ME_ttbar_lnulnu.merge.NTUP_TOP.e2275_a186_s1571_a145_r2993_p937_tid01350830_00/NTUP_TOP.01350830._000087.root.3 (err null null).
vingar commented 4 months ago

One more observation:


# curl --capath /etc/grid-security/certificates --cert /tmp/x509up_u0 --cacert /tmp/x509up_u0 --key /tmp/x509up_u0 https://dcfrontend.usatlas.bnl.gov:3880/api/v1/tape/stage/7cf32cfb-7c15-42f7-9e9b-0cbb1b08a37c 
{
  "id" : "7cf32cfb-7c15-42f7-9e9b-0cbb1b08a37c",
  "createdAt" : 1710253683266,
  "startedAt" : 1710253683278,
  "completedAt" : 1710253683300,
  "files" : [ {
    "path" : "/fs/usatlas.bnl.gov/MCTAPE/atlasgrouptape/mc11_7TeV/NTUP_TOP/e2275_a186_s1571_a145_r2993_p937/mc11_7TeV.110198.McAtNloHerwigpp_UEEE4_MRSTMCal_CT10ME_ttbar_lnulnu.merge.NTUP_TOP.e2275_a186_s1571_a145_r2993_p937_tid01350830_00/NTUP_TOP.01350830._000087.root.3",
    "finishedAt" : 1710253683269,
    "startedAt" : 1710253683269,
    "error" : "CacheException(rc=10001;msg=No such file or directory /fs/usatlas.bnl.gov/MCTAPE/atlasgrouptape/mc11_7TeV/NTUP_TOP/e2275_a186_s1571_a145_r2993_p937/mc11_7TeV.110198.McAtNloHerwigpp_UEEE4_MRSTMCal_CT10ME_ttbar_lnulnu.merge.NTUP_TOP.e2275_a186_s1571_a145_r29",
    "state" : "FAILED"
  } ]```

please not the truncated path. 
DmitryLitvintsev commented 4 months ago

I am pretty sure the file earth is truncated due to db text width limit set on DB end.

I will investigate the NPE.

DmitryLitvintsev commented 4 months ago

Carlos,

Question. Attempts to call stage on any other files in this directory also causes NPE or not?

Dmitry

vingar commented 4 months ago

The path appears also truncated in the path column in the request_target table on the bulk db.

DmitryLitvintsev commented 4 months ago

yes, It is limited, but it should not fail because of that. That's why I asked if it fails on some particular files on any files

vingar commented 4 months ago

Carlos,

Question. Attempts to call stage on any other files in this directory also causes NPE or not?

Dmitry

yes, there is a same error for another file in this directory:

# curl -L  --capath /etc/grid-security/certificates --cert  /tmp/x509up_u0 --cacert  /tmp/x509up_u0 --key /tmp/x509up_u0 -X POST https://dcrontend.usatlas.bnl.gov:3880/api/v1/tape/stage -H "Content-Type: application/json" -d '{"files": [ {"diskLifetime":"P1D","path":"/pnfs/usatlas.bnl.gov/MCTAPE/atlasgrouptape/mc11_7TeV/NTUP_TOP/e2275_a186_s1571_a145_r2993_p937/mc11_7TeV.110198.McAtNloHerwigpp_UEEE4_MRSTMCal_CT10ME_ttbar_lnulnu.merge.NTUP_TOP.e2275_a186_s1571_a145_r2993_p937_tid01350830_00/NTUP_TOP.01350830._000019.root.2"}]}'
{"detail":"Internal error: java.lang.NullPointerException","title":"Internal Server Error","status":"500"}
DmitryLitvintsev commented 4 months ago

OK,

Lets try low hanging fruit. Of course Al has tested the patch where the truncation is used on long paths. AND it was not supposed to matter. But just in case. I can revert this and update DB to accept longer path name.

Will you be able to test the RPM I will make for you? Do you have test / integration system where you can test it independently from production?

Dmitry

P.S.: rationale for path truncation was to keep DB smaller.

vingar commented 4 months ago

Will you be able to test the RPM I will make for you? Do you have test / integration system where you can test it independently from production?

yes, sure

DmitryLitvintsev commented 4 months ago

OK, hers is he rpm

https://drive.google.com/file/d/1VoYKz1MCrI5_TCgZ8BXpVh5bmhqZSoVk/view?usp=sharing

This is top of 9.2. It need to be deployed only where you run bulk

DmitryLitvintsev commented 4 months ago

Having looked the code I see that :

"error" : "CacheException(rc=10001;msg=No such file or directory /fs/usatlas.bnl.gov/MCTAPE/atlasgrouptape/mc11_7TeV/NTUP_TOP/e2275_a186_s1571_a145_r2993_p937/mc11_7TeV.110198.McAtNloHerwigpp_UEEE4_MRSTMCal_CT10ME_ttbar_lnulnu.merge.NTUP_TOP.e2275_a186_s1571_a145_r29"

the error string is simply truncated to 256 characters :

>>> p="CacheException(rc=10001;msg=No such file or directory /fs/usatlas.bnl.gov/MCTAPE/atlasgrouptape/mc11_7TeV/NTUP_TOP/e2275_a186_s1571_a145_r2993_p937/mc11_7TeV.110198.McAtNloHerwigpp_UEEE4_MRSTMCal_CT10ME_ttbar_lnulnu.merge.NTUP_TOP.e2275_a186_s1571_a145_r29"
>>> len(p)
256
>>> 

So the truncated path _isnot a problem.

BUT , could you check if :

/fs/usatlas.bnl.gov/MCTAPE/atlasgrouptape/mc11_7TeV/NTUP_TOP/e2275_a186_s1571_a145_r2993_p937/mc11_7TeV.110198.McAtNloHerwigpp_UEEE4_MRSTMCal_CT10ME_ttbar_lnulnu.merge.NTUP_TOP.e2275_a186_s1571_a145_r2993_p937_tid01350830_00/NTUP_TOP.01350830._000087.root.3

indeed exists.

I suspect the path should start with /pnfs.

What is absolute path of this file? I think issue has to do with prefix handling. Prefix handling has been fixed some time ago and is contained in RPM above. So test the RPM please.

What will help is this

path

/pnfs/usatlas.bnl.gov/MCTAPE/atlasgrouptape/mc11_7TeV/NTUP_TOP/e2275_a186_s1571_a145_r2993_p937/mc11_7TeV.110198.McAtNloHerwigpp_UEEE4_MRSTMCal_CT10ME_ttbar_lnulnu.merge.NTUP_TOP.e2275_a186_s1571_a145_r2993_p937_tid01350830_00/NTUP_TOP.01350830._000087.root.3

what is its PNSID

and then, with that PNFSID

\sn pathfinder <PNFSID>

This will help to understand what is happening.

vingar commented 4 months ago

BUT , could you check if :

/fs/usatlas.bnl.gov/MCTAPE/atlasgrouptape/mc11_7TeV/NTUP_TOP/e2275_a186_s1571_a145_r2993_p937/mc11_7TeV.110198.McAtNloHerwigpp_UEEE4_MRSTMCal_CT10ME_ttbar_lnulnu.merge.NTUP_TOP.e2275_a186_s1571_a145_r2993_p937_tid01350830_00/NTUP_TOP.01350830._000087.root.3

indeed exists.

No, it doesn't.

I suspect the path should start with /pnfs. yes

> \sn pathfinder 00003BE34500448543C9A3E7CC23149D846C
/pnfs/usatlas.bnl.gov/MCTAPE/atlasgrouptape/mc11_7TeV/NTUP_TOP/e2275_a186_s1571_a145_r2993_p937/mc11_7TeV.110198.McAtNloHerwigpp_UEEE4_MRSTMCal_CT10ME_ttbar_lnulnu.merge.NTUP_TOP.e2275_a186_s1571_a145_r2993_p937_tid01350830_00/NTUP_TOP.01350830._000087.root.3
DmitryLitvintsev commented 4 months ago

OK, give this RPM a try please.

DmitryLitvintsev commented 4 months ago

Any progress on this?

samuambroj commented 4 months ago

Hi,

We experienced the same issue yesterday while testing the upgrade to dCache 9.2.15 in our pre-production instance. We started an RT ticket and @lemora pointed out to this ticket. So, that is the reason why I jump in ;-) We have tested Dmitry's rpm with two configurations:

With just one bulk domain, the error has disappeared, but with the second configuration, the error persists.

If we can be of any further help, please let us know.

Cheers, Samuel

samuambroj commented 4 months ago

As suggested by @DmitryLitvintsev, we have also tested one single bulk domain, with the standard 9.2.15 rpm and in that case the issue has disappeared.

vingar commented 3 months ago

Yes, we moved to a single bulk domain as well, and this addressed the issue. Now, we're waiting to have support for multiple bulk domains ;)