Closed helrond closed 6 years ago
Thanks for noting this. Just to clarify -- in process of setting up environment to test -- is the first loop successful, e.g. i
== 0? or is even the first object not created?
Yes the first test works fine!
Yep, I'm able to recreate. I'm seeing the following errors coming from Fedora after the timeout is exceeded, between iterations:
SEVERE: An I/O error has occurred while writing a response message entity to the container output stream.
org.glassfish.jersey.server.internal.process.MappableException: java.io.IOException: java.util.concurrent.TimeoutException: Idle timeout expired: 30004/30000 ms
at org.glassfish.jersey.server.internal.MappableExceptionWrapperInterceptor.aroundWriteTo(MappableExceptionWrapperInterceptor.java:92)
at org.glassfish.jersey.message.internal.WriterInterceptorExecutor.proceed(WriterInterceptorExecutor.java:162)
at org.glassfish.jersey.message.internal.MessageBodyFactory.writeTo(MessageBodyFactory.java:1130)
at org.glassfish.jersey.server.ServerRuntime$Responder.writeResponse(ServerRuntime.java:711)
at org.glassfish.jersey.server.ServerRuntime$Responder.processResponse(ServerRuntime.java:444)
at org.glassfish.jersey.server.ServerRuntime$Responder.process(ServerRuntime.java:434)
at org.glassfish.jersey.server.ServerRuntime$2.run(ServerRuntime.java:329)
at org.glassfish.jersey.internal.Errors$1.call(Errors.java:271)
at org.glassfish.jersey.internal.Errors$1.call(Errors.java:267)
at org.glassfish.jersey.internal.Errors.process(Errors.java:315)
at org.glassfish.jersey.internal.Errors.process(Errors.java:297)
at org.glassfish.jersey.internal.Errors.process(Errors.java:267)
at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:317)
at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:305)
at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:1154)
at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:473)
at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:427)
at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388)
at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341)
at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228)
at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:769)
at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1125)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1059)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
at org.eclipse.jetty.server.Server.handle(Server.java:497)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:311)
at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:248)
at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:610)
at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:539)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: java.util.concurrent.TimeoutException: Idle timeout expired: 30004/30000 ms
at org.eclipse.jetty.util.SharedBlockingCallback$Blocker.block(SharedBlockingCallback.java:234)
at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:133)
at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:347)
at org.glassfish.jersey.servlet.internal.ResponseWriter$NonCloseableOutputStreamWrapper.write(ResponseWriter.java:325)
at org.glassfish.jersey.message.internal.CommittingOutputStream.write(CommittingOutputStream.java:229)
at org.glassfish.jersey.message.internal.WriterInterceptorExecutor$UnCloseableOutputStream.write(WriterInterceptorExecutor.java:299)
at org.glassfish.jersey.message.internal.ReaderWriter.writeTo(ReaderWriter.java:116)
at org.glassfish.jersey.message.internal.AbstractMessageReaderWriterProvider.writeTo(AbstractMessageReaderWriterProvider.java:79)
at org.glassfish.jersey.message.internal.InputStreamProvider.writeTo(InputStreamProvider.java:105)
at org.glassfish.jersey.message.internal.InputStreamProvider.writeTo(InputStreamProvider.java:60)
at org.glassfish.jersey.message.internal.WriterInterceptorExecutor$TerminalWriterInterceptor.invokeWriteTo(WriterInterceptorExecutor.java:265)
at org.glassfish.jersey.message.internal.WriterInterceptorExecutor$TerminalWriterInterceptor.aroundWriteTo(WriterInterceptorExecutor.java:250)
at org.glassfish.jersey.message.internal.WriterInterceptorExecutor.proceed(WriterInterceptorExecutor.java:162)
at org.glassfish.jersey.server.internal.JsonWithPaddingInterceptor.aroundWriteTo(JsonWithPaddingInterceptor.java:106)
at org.glassfish.jersey.message.internal.WriterInterceptorExecutor.proceed(WriterInterceptorExecutor.java:162)
at org.glassfish.jersey.server.internal.MappableExceptionWrapperInterceptor.aroundWriteTo(MappableExceptionWrapperInterceptor.java:86)
... 39 more
Caused by: java.util.concurrent.TimeoutException: Idle timeout expired: 30004/30000 ms
at org.eclipse.jetty.io.IdleTimeout.checkIdleTimeout(IdleTimeout.java:156)
at org.eclipse.jetty.io.IdleTimeout$1.run(IdleTimeout.java:50)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
... 1 more
Edited: removed errors about colliding filenames, that was my incorrect testing, as the container is the loop
Not immediately clear what's causing the timeout, will do some digging! Wondering if the file upload for the Binary
model is somehow left open, resulting in this java.util.concurrent.TimeoutException: Idle timeout expired
?
Remembering now, this is about where I left off with the PCDM plugin work, as these commented out tests suggest.
So, might not be an obvious or quick fix, but appreciating the opportunity to return to this.
Confirmed it's related to file I/O, as a simple loop with throwaway text and mimetype text/plain
works just fine.
FYI, in case it helps, replacing the problem line with:
new_binary = requests.post("{}/files/".format(uri), files={"file": file_data}, headers={"Content-Type": "application/pdf"})
resolves the issue as well. Note the files
argument.
That's very helpful to know! My hunch is that the requests library treats files
and data
differently in such a way that one, when pulling from a file object -- data
-- does not close the transfer correctly with FC and you get the timeout.
Might look into detecting if file object and using files
instead? My guess, is that it does work for a single upload, if there is not another immediately following up as is the case with that loop.
Well @helrond, this might have been a comedy of errors and terrible memory on my part, apologies. Poking around a bit last night, remembered there had been considerable revisiting of the PCDM plugin last September in the pcdmrework
branch, but was never merged.
I believe the problem you rightly pointed out is addressed with these changes. Namely, the create_file()
method is gone, in favor of an entirely new class to handle binary files for PCDM objects, PCDMFile
, that is a wrapper around Binary
and NonRDFResource
.
I've included a test to attempt a binary file creation loop, much like your iteration, using a file object:
def test_create_file_fileobject_loop(self):
for x in range(10):
# open README.md as file object
with open('pyfc4/plugins/pcdm/README.md','rb') as f:
# create readme binary as file for green in /files
readme = pcdm.models.PCDMFile(repo, '%s/green/files/readme_%s' % (testing_container_uri,x), binary_data=f, binary_mimetype='text/plain')
readme.create(specify_uri=True)
assert readme.exists
# retrieve
readme = repo.get_resource(readme.uri)
assert type(readme) == pcdm.models.PCDMFile
# assert in green's files
green = repo.get_resource('%s/green' % testing_container_uri)
assert readme.uri in green.files
And it appears to work, without stalling. The primary difference I see, is that your loop was creating a new PCDMObject
each time, where this is focusing on adding a bunch of PCDMFile
s to the same object.
Also, there is documentation already written with conventions for using the PCDM plugin, which might be helpful.
Going to merge these into dev
and master
, and bump the version to v0.3
. Will update here when that's done.
Alrighty, version bumped with the PCDM updates merged.
And here is the README for the PCDM plugin: https://github.com/ghukill/pyfc4/tree/master/pyfc4/plugins/pcdm
Hopefully that will address the issue, but if not, can continue digging in.
Closing issue, but can reopen if passing file objects to PCDMObjects still presents an issue.
Unfortunately, I'm still experiencing this issue. It seems to be somewhat related to file size, as files in the 200 KB range seem to work fine. I'm experiencing the hanging behavior with files above 400 KB (I haven't found a clear demarcating line yet, those are approximate ranges).
Here's the code I'm running:
#! usr/bin/env python
import mimetypes
import requests
import time
from uuid import uuid4
from os.path import basename
from pyfc4 import models as fcrepo
from pyfc4.plugins.pcdm import models as pcdm
root = 'http://fedora:8080/fedora/rest/'
client = fcrepo.Repository(root=root, username='user', password='pass')
for i in range(500):
start_time = time.time()
container_uri = "{}{}".format(root, uuid4())
container = pcdm.PCDMObject(client, uri=container_uri)
container.create(specify_uri=True)
print("container created: {}".format(container.uri_as_string()))
mimetype = mimetypes.guess_type("file.pdf")[0]
print("mimetype determined: {}".format(time.time()-start_time))
with open("file.pdf", 'rb') as f:
new_binary = pcdm.PCDMFile(client, "{}/files/{}".format(container_uri, 'file.pdf'), binary_data=f, binary_mimetype=mimetype)
new_binary.create(specify_uri=True)
print(new_binary.uri_as_string())
print("*** Time to execute: {} seconds".format(time.time()-start_time))
And I've attached the file I'm attempting to POST below: file.pdf
Gah! Thanks for the update. It was probably wishful thinking that the PCDM reworking might have addressed this, without really digging in.
But saw your PR - thank you! - so will keep this issue closed, and potentially open more appropriate ones as mentioned in your PR notes.
@ghukill I think I may have introduced an even worse error with #96 - apparently Fedora has trouble handling multipart form-data, which makes sense since you wouldn't want to POST multiple files as binary data to the same NonRDFSource. This means that files which get posted up are corrupted, because they have the wrong content-length. I'm going to do a bit more investigating and see if I can figure out what's going on.
Thanks for the update! I'm ashamed to say I didn't test or notice myself. Thanks for looking into it.
OK, I think I have a line on this. Looking more closely at the http_request
method of the API
class, I realized it's using a Session object. Digging into the requests docs, I found this:
Note that connections are only released back to the pool for reuse once all body data has been read; be sure to either set stream to False or read the content property of the Response object.
I separately tried both of these approaches (adding print(response.content)
at line 537 and explicitly setting stream=False
on line 535) and both worked.
I also tried replacing that entire block with a single line, which also worked:
response = getattr(requests, verb.lower())(uri, auth=(self.repo.username, self.repo.password), data=data, headers=headers, files=files)
Do you have a preference for a PR? It doesn't seem like the Session object is doing anything special, since it's getting created each time this method is called (right?), and there doesn't seem to be a performance hit (although more testing would probably be good) so I guess I'd lean towards the latter approach?
Ah, interesting. Just out of curiosity, does adressing the sessions -- either by avoiding altogether, or closing by reading content -- help with the performance and hanging issues? I could imagine, if a bunch of sessions are left open, they would pile up.
As for establishing a Session
object in the first place, I really wish I could remember why it seemed important at the time, but I can't. Though, as I type, I'm inclined to think it was the ability to pass the HTTP verb as a string. But I love the getattr
and lower()
to handle that, which is even more elegant than a hash to use. I'm all for dropping the Session
object altogether, for all the reasons you point out.
Thanks for tracking this down, and happy to adopt whatever makes sense to you! It's nice: your comments and bug tracking have been honing in on areas that needed revisiting.
Yes, sorry, but "it works" I meant "it addresses the hanging issue" - based on the requests docs I think the Sessions object was keeping the connection alive when Fedora would prefer a new connection. PR coming up!
Awesome, thanks for the clarification. I wait upon bated breath....
OK, I think I realized why you used the session object...it's because Fedora accepts MOVE
and COPY
HTTP verbs, which the requests library doesn't allow...Back to the drawing board.
Ahhh.... thanks for all your poking around on this.
Maybe a hash could still work? Something like:
verb_hash = {
'POST':requests.POST,
'GET':requests.GET,
'MOVE':requests.POST # <!-- notice the translation of "MOVE" verb to requests.POST,
...
...
}
And then could be called a la a slight modification of your one liner?
response = verb_hash[verb.upper()](uri, auth=(self.repo.username, self.repo.password), data=data, headers=headers, files=files)
OK, I think this whole thing turned out to be a case of me not reading the documentation carefully. The thing that's causing the hang is part of the refresh()
routine you've built into the create()
and update()
methods - a GET request is made to update the object, and when there is binary content, the stream
argument is set to True
(which makes sense). The problem is that this request doesn't close out. (The reason my earlier code worked was that I omitted the stream
argument from the request by mistake...)
I think there are a couple ways of addressing this:
auto_refresh
false by defaultrefresh_binary
flag in create()
like you've done for update()
Maybe there are other ways too? In the meantime I've submitted a PR that gets the PCDM plugin to respect the auto_refresh argument in _post_create()
hooks.
Finally had a few moments to fire up my environment and see if I can follow along. And I think I do...
Merged the PR that propogated the auto_refresh
flag to the PCDM plugin, which was a great idea, thank you. And in doing so, was able to sidestep the stalling issue by setting auto_refresh=False
for the create
methods in the following slight modificatoin of one of your examples from above:
for i in range(10):
start_time = time.time()
container_uri = "{}/TEMPORARY2/{}".format(root, uuid4())
container = pcdm.PCDMObject(client, uri=container_uri)
container.create(specify_uri=True, auto_refresh=False, ignore_tombstone=True)
print("container created: {}".format(container.uri_as_string()))
mimetype = mimetypes.guess_type("file.pdf")[0]
print("mimetype determined: {}".format(time.time()-start_time))
with open("file.pdf", 'rb') as f:
new_binary = pcdm.PCDMFile(client, "{}/files/{}".format(container_uri, 'file.pdf'), binary_data=f, binary_mimetype=mimetype)
new_binary.create(specify_uri=True, auto_refresh=False, ignore_tombstone=True)
print(new_binary.uri_as_string())
print("*** Time to execute: {} seconds".format(time.time()-start_time))
This is because, as I think you rightly pointed out but re-stating for my own clarity, this auto_refresh
flag is passed along to _handle_create
, and thus avoids the update()
that would be fired. And, thanks to your PR, this trickles down to the _post_create
hook for the PCDM objects avoiding that potential update()
as well.
Okay, phew, good. Using that PDF, was able to loop through and create n-number with sub-second times, no waiting, no stalling.
I confirmed this by setting the stream
under the parse_binary
method to False
, and sure enough, even with all auto_refresh
set to True
, works like a champ. Good catch again, that's the cause of the stalling.
But, to your suggestions. Good question. Setting auto_refresh
False
by default is tempting. Though the open requests.response
object at new_binary.binary.data
is most noticeable and problematic in a loop, it's nonetheless always present. It's prohibitevly slow to have stream
set to False and load potentially very large data, but this open request.response
object is not great either.
I was curious if the streamed connection could be manually closed, as a test, but not having much luck. Wondering if the connection is released to requests pool for the client, but FC4 still sees an open connection.
Though I like the idea of setting a refresh_binary
flag for create()
, it would almost have to default to False
, which feels as though it would be confusing.
Need to step away at the moment, but ruminating on this, and I would propose just using auto_refresh=False
when doing work with binary objects in bulk, which is faster anywhow.
Will revisit in the morning, and at the very least, bump version with this PR. Big thanks again for all your investigative work and substantive PRs.
Alrighty @helrond , did a little work normalizing some of the auto_refresh
flags in how they are set, and for unit tests. Bumped version to v0.6
: https://pypi.org/project/pyfc4/0.6/.
Still feels a bit unfinished, but the flags and parameters are there to get what's needed done. Specifically, by setting auto_refresh
to False
when performing create()
or update()
should be enough to avoid hanging timeouts from request.response
dangling from .binary
object.
Great, thanks @ghukill - I think that's the best solution!
When using the PCDM plugin, I'm experiencing unexpected behavior in the
create_file
method ofPCDMObject
. The following code will stall after the first loop, until whatever timeout is set for Fedora in Tomcat:I've tested this with a bash script as well as another script which just uses the
requests
library (happy to share both), and I've pretty definitely narrowed it down to thenew_binary = container.create_file('filename', specify_uri=True, data=file_data, mimetype=mimetype)
line. I believe this may come down to the data parameter, but I'm not sure. Any ideas of where to start troubleshooting?