RockefellerArchiveCenter / pyfc4

Python client for Fedora Commons 4
MIT License
7 stars 0 forks source link

PCDMObject create_file() method stalls #94

Closed helrond closed 6 years ago

helrond commented 6 years ago

When using the PCDM plugin, I'm experiencing unexpected behavior in the create_file method of PCDMObject. The following code will stall after the first loop, until whatever timeout is set for Fedora in Tomcat:

#! usr/bin/env python

import mimetypes
from os.path import basename
from pyfc4 import models as fcrepo
from pyfc4.plugins.pcdm import models as pcdm

client = fcrepo.Repository(root='http://fedora:8080/fedora/rest/', username='user', password='pass',)

for i in range(500):
    container = pcdm.PCDMObject(repo=client)
    container.create()
    mimetype = mimetypes.guess_type("file.pdf")[0]
    with open("file.pdf", 'rb') as file_data:
        new_binary = container.create_file('filename', specify_uri=True, data=file_data, mimetype=mimetype)
        file_data.close()

I've tested this with a bash script as well as another script which just uses the requests library (happy to share both), and I've pretty definitely narrowed it down to the new_binary = container.create_file('filename', specify_uri=True, data=file_data, mimetype=mimetype) line. I believe this may come down to the data parameter, but I'm not sure. Any ideas of where to start troubleshooting?

ghukill commented 6 years ago

Thanks for noting this. Just to clarify -- in process of setting up environment to test -- is the first loop successful, e.g. i == 0? or is even the first object not created?

helrond commented 6 years ago

Yes the first test works fine!

ghukill commented 6 years ago

Yep, I'm able to recreate. I'm seeing the following errors coming from Fedora after the timeout is exceeded, between iterations:

SEVERE: An I/O error has occurred while writing a response message entity to the container output stream.
org.glassfish.jersey.server.internal.process.MappableException: java.io.IOException: java.util.concurrent.TimeoutException: Idle timeout expired: 30004/30000 ms
    at org.glassfish.jersey.server.internal.MappableExceptionWrapperInterceptor.aroundWriteTo(MappableExceptionWrapperInterceptor.java:92)
    at org.glassfish.jersey.message.internal.WriterInterceptorExecutor.proceed(WriterInterceptorExecutor.java:162)
    at org.glassfish.jersey.message.internal.MessageBodyFactory.writeTo(MessageBodyFactory.java:1130)
    at org.glassfish.jersey.server.ServerRuntime$Responder.writeResponse(ServerRuntime.java:711)
    at org.glassfish.jersey.server.ServerRuntime$Responder.processResponse(ServerRuntime.java:444)
    at org.glassfish.jersey.server.ServerRuntime$Responder.process(ServerRuntime.java:434)
    at org.glassfish.jersey.server.ServerRuntime$2.run(ServerRuntime.java:329)
    at org.glassfish.jersey.internal.Errors$1.call(Errors.java:271)
    at org.glassfish.jersey.internal.Errors$1.call(Errors.java:267)
    at org.glassfish.jersey.internal.Errors.process(Errors.java:315)
    at org.glassfish.jersey.internal.Errors.process(Errors.java:297)
    at org.glassfish.jersey.internal.Errors.process(Errors.java:267)
    at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:317)
    at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:305)
    at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:1154)
    at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:473)
    at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:427)
    at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:388)
    at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:341)
    at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228)
    at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:769)
    at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
    at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
    at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
    at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1125)
    at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
    at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
    at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1059)
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
    at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
    at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
    at org.eclipse.jetty.server.Server.handle(Server.java:497)
    at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:311)
    at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:248)
    at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:610)
    at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:539)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: java.util.concurrent.TimeoutException: Idle timeout expired: 30004/30000 ms
    at org.eclipse.jetty.util.SharedBlockingCallback$Blocker.block(SharedBlockingCallback.java:234)
    at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:133)
    at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:347)
    at org.glassfish.jersey.servlet.internal.ResponseWriter$NonCloseableOutputStreamWrapper.write(ResponseWriter.java:325)
    at org.glassfish.jersey.message.internal.CommittingOutputStream.write(CommittingOutputStream.java:229)
    at org.glassfish.jersey.message.internal.WriterInterceptorExecutor$UnCloseableOutputStream.write(WriterInterceptorExecutor.java:299)
    at org.glassfish.jersey.message.internal.ReaderWriter.writeTo(ReaderWriter.java:116)
    at org.glassfish.jersey.message.internal.AbstractMessageReaderWriterProvider.writeTo(AbstractMessageReaderWriterProvider.java:79)
    at org.glassfish.jersey.message.internal.InputStreamProvider.writeTo(InputStreamProvider.java:105)
    at org.glassfish.jersey.message.internal.InputStreamProvider.writeTo(InputStreamProvider.java:60)
    at org.glassfish.jersey.message.internal.WriterInterceptorExecutor$TerminalWriterInterceptor.invokeWriteTo(WriterInterceptorExecutor.java:265)
    at org.glassfish.jersey.message.internal.WriterInterceptorExecutor$TerminalWriterInterceptor.aroundWriteTo(WriterInterceptorExecutor.java:250)
    at org.glassfish.jersey.message.internal.WriterInterceptorExecutor.proceed(WriterInterceptorExecutor.java:162)
    at org.glassfish.jersey.server.internal.JsonWithPaddingInterceptor.aroundWriteTo(JsonWithPaddingInterceptor.java:106)
    at org.glassfish.jersey.message.internal.WriterInterceptorExecutor.proceed(WriterInterceptorExecutor.java:162)
    at org.glassfish.jersey.server.internal.MappableExceptionWrapperInterceptor.aroundWriteTo(MappableExceptionWrapperInterceptor.java:86)
    ... 39 more
Caused by: java.util.concurrent.TimeoutException: Idle timeout expired: 30004/30000 ms
    at org.eclipse.jetty.io.IdleTimeout.checkIdleTimeout(IdleTimeout.java:156)
    at org.eclipse.jetty.io.IdleTimeout$1.run(IdleTimeout.java:50)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    ... 1 more

Edited: removed errors about colliding filenames, that was my incorrect testing, as the container is the loop

Not immediately clear what's causing the timeout, will do some digging! Wondering if the file upload for the Binary model is somehow left open, resulting in this java.util.concurrent.TimeoutException: Idle timeout expired?

ghukill commented 6 years ago

Remembering now, this is about where I left off with the PCDM plugin work, as these commented out tests suggest.

So, might not be an obvious or quick fix, but appreciating the opportunity to return to this.

Confirmed it's related to file I/O, as a simple loop with throwaway text and mimetype text/plain works just fine.

helrond commented 6 years ago

FYI, in case it helps, replacing the problem line with:

new_binary = requests.post("{}/files/".format(uri), files={"file": file_data}, headers={"Content-Type": "application/pdf"})

resolves the issue as well. Note the files argument.

ghukill commented 6 years ago

That's very helpful to know! My hunch is that the requests library treats files and data differently in such a way that one, when pulling from a file object -- data -- does not close the transfer correctly with FC and you get the timeout.

Might look into detecting if file object and using files instead? My guess, is that it does work for a single upload, if there is not another immediately following up as is the case with that loop.

ghukill commented 6 years ago

Well @helrond, this might have been a comedy of errors and terrible memory on my part, apologies. Poking around a bit last night, remembered there had been considerable revisiting of the PCDM plugin last September in the pcdmrework branch, but was never merged.

I believe the problem you rightly pointed out is addressed with these changes. Namely, the create_file() method is gone, in favor of an entirely new class to handle binary files for PCDM objects, PCDMFile, that is a wrapper around Binary and NonRDFResource.

I've included a test to attempt a binary file creation loop, much like your iteration, using a file object:

    def test_create_file_fileobject_loop(self):

        for x in range(10):

            # open README.md as file object
            with open('pyfc4/plugins/pcdm/README.md','rb') as f:

                # create readme binary as file for green in /files
                readme = pcdm.models.PCDMFile(repo, '%s/green/files/readme_%s' % (testing_container_uri,x), binary_data=f, binary_mimetype='text/plain')
                readme.create(specify_uri=True)
                assert readme.exists

            # retrieve
            readme = repo.get_resource(readme.uri)
            assert type(readme) == pcdm.models.PCDMFile

            # assert in green's files
            green = repo.get_resource('%s/green' % testing_container_uri)
            assert readme.uri in green.files

And it appears to work, without stalling. The primary difference I see, is that your loop was creating a new PCDMObject each time, where this is focusing on adding a bunch of PCDMFiles to the same object.

Also, there is documentation already written with conventions for using the PCDM plugin, which might be helpful.

Going to merge these into dev and master, and bump the version to v0.3. Will update here when that's done.

ghukill commented 6 years ago

Alrighty, version bumped with the PCDM updates merged.

And here is the README for the PCDM plugin: https://github.com/ghukill/pyfc4/tree/master/pyfc4/plugins/pcdm

Hopefully that will address the issue, but if not, can continue digging in.

ghukill commented 6 years ago

Closing issue, but can reopen if passing file objects to PCDMObjects still presents an issue.

helrond commented 6 years ago

Unfortunately, I'm still experiencing this issue. It seems to be somewhat related to file size, as files in the 200 KB range seem to work fine. I'm experiencing the hanging behavior with files above 400 KB (I haven't found a clear demarcating line yet, those are approximate ranges).

Here's the code I'm running:

#! usr/bin/env python

import mimetypes
import requests
import time
from uuid import uuid4
from os.path import basename
from pyfc4 import models as fcrepo
from pyfc4.plugins.pcdm import models as pcdm

root = 'http://fedora:8080/fedora/rest/'
client = fcrepo.Repository(root=root, username='user', password='pass')

for i in range(500):
    start_time = time.time()
    container_uri = "{}{}".format(root, uuid4())
    container = pcdm.PCDMObject(client, uri=container_uri)
    container.create(specify_uri=True)
    print("container created: {}".format(container.uri_as_string()))
    mimetype = mimetypes.guess_type("file.pdf")[0]
    print("mimetype determined: {}".format(time.time()-start_time))
    with open("file.pdf", 'rb') as f:
        new_binary = pcdm.PCDMFile(client, "{}/files/{}".format(container_uri, 'file.pdf'), binary_data=f, binary_mimetype=mimetype)
        new_binary.create(specify_uri=True)
        print(new_binary.uri_as_string())
    print("*** Time to execute: {} seconds".format(time.time()-start_time))

And I've attached the file I'm attempting to POST below: file.pdf

ghukill commented 6 years ago

Gah! Thanks for the update. It was probably wishful thinking that the PCDM reworking might have addressed this, without really digging in.

But saw your PR - thank you! - so will keep this issue closed, and potentially open more appropriate ones as mentioned in your PR notes.

helrond commented 6 years ago

@ghukill I think I may have introduced an even worse error with #96 - apparently Fedora has trouble handling multipart form-data, which makes sense since you wouldn't want to POST multiple files as binary data to the same NonRDFSource. This means that files which get posted up are corrupted, because they have the wrong content-length. I'm going to do a bit more investigating and see if I can figure out what's going on.

ghukill commented 6 years ago

Thanks for the update! I'm ashamed to say I didn't test or notice myself. Thanks for looking into it.

helrond commented 6 years ago

OK, I think I have a line on this. Looking more closely at the http_request method of the API class, I realized it's using a Session object. Digging into the requests docs, I found this:

Note that connections are only released back to the pool for reuse once all body data has been read; be sure to either set stream to False or read the content property of the Response object.

I separately tried both of these approaches (adding print(response.content) at line 537 and explicitly setting stream=False on line 535) and both worked.

I also tried replacing that entire block with a single line, which also worked: response = getattr(requests, verb.lower())(uri, auth=(self.repo.username, self.repo.password), data=data, headers=headers, files=files)

Do you have a preference for a PR? It doesn't seem like the Session object is doing anything special, since it's getting created each time this method is called (right?), and there doesn't seem to be a performance hit (although more testing would probably be good) so I guess I'd lean towards the latter approach?

ghukill commented 6 years ago

Ah, interesting. Just out of curiosity, does adressing the sessions -- either by avoiding altogether, or closing by reading content -- help with the performance and hanging issues? I could imagine, if a bunch of sessions are left open, they would pile up.

As for establishing a Session object in the first place, I really wish I could remember why it seemed important at the time, but I can't. Though, as I type, I'm inclined to think it was the ability to pass the HTTP verb as a string. But I love the getattr and lower() to handle that, which is even more elegant than a hash to use. I'm all for dropping the Session object altogether, for all the reasons you point out.

Thanks for tracking this down, and happy to adopt whatever makes sense to you! It's nice: your comments and bug tracking have been honing in on areas that needed revisiting.

helrond commented 6 years ago

Yes, sorry, but "it works" I meant "it addresses the hanging issue" - based on the requests docs I think the Sessions object was keeping the connection alive when Fedora would prefer a new connection. PR coming up!

ghukill commented 6 years ago

Awesome, thanks for the clarification. I wait upon bated breath....

helrond commented 6 years ago

OK, I think I realized why you used the session object...it's because Fedora accepts MOVE and COPY HTTP verbs, which the requests library doesn't allow...Back to the drawing board.

ghukill commented 6 years ago

Ahhh.... thanks for all your poking around on this.

Maybe a hash could still work? Something like:

verb_hash = {
    'POST':requests.POST,
    'GET':requests.GET,
    'MOVE':requests.POST # <!-- notice the translation of "MOVE" verb to requests.POST,
    ...
    ...
}

And then could be called a la a slight modification of your one liner?

response = verb_hash[verb.upper()](uri, auth=(self.repo.username, self.repo.password), data=data, headers=headers, files=files)
helrond commented 6 years ago

OK, I think this whole thing turned out to be a case of me not reading the documentation carefully. The thing that's causing the hang is part of the refresh() routine you've built into the create() and update() methods - a GET request is made to update the object, and when there is binary content, the stream argument is set to True (which makes sense). The problem is that this request doesn't close out. (The reason my earlier code worked was that I omitted the stream argument from the request by mistake...)

I think there are a couple ways of addressing this:

  1. Make auto_refresh false by default
  2. add a refresh_binary flag in create() like you've done for update()

Maybe there are other ways too? In the meantime I've submitted a PR that gets the PCDM plugin to respect the auto_refresh argument in _post_create() hooks.

ghukill commented 6 years ago

Finally had a few moments to fire up my environment and see if I can follow along. And I think I do...

Merged the PR that propogated the auto_refresh flag to the PCDM plugin, which was a great idea, thank you. And in doing so, was able to sidestep the stalling issue by setting auto_refresh=False for the create methods in the following slight modificatoin of one of your examples from above:

for i in range(10):
    start_time = time.time()
    container_uri = "{}/TEMPORARY2/{}".format(root, uuid4())
    container = pcdm.PCDMObject(client, uri=container_uri)
    container.create(specify_uri=True, auto_refresh=False, ignore_tombstone=True)
    print("container created: {}".format(container.uri_as_string()))
    mimetype = mimetypes.guess_type("file.pdf")[0]
    print("mimetype determined: {}".format(time.time()-start_time))
    with open("file.pdf", 'rb') as f:
        new_binary = pcdm.PCDMFile(client, "{}/files/{}".format(container_uri, 'file.pdf'), binary_data=f, binary_mimetype=mimetype)
        new_binary.create(specify_uri=True, auto_refresh=False, ignore_tombstone=True)
        print(new_binary.uri_as_string())
    print("*** Time to execute: {} seconds".format(time.time()-start_time))

This is because, as I think you rightly pointed out but re-stating for my own clarity, this auto_refresh flag is passed along to _handle_create, and thus avoids the update() that would be fired. And, thanks to your PR, this trickles down to the _post_create hook for the PCDM objects avoiding that potential update() as well.

Okay, phew, good. Using that PDF, was able to loop through and create n-number with sub-second times, no waiting, no stalling.

I confirmed this by setting the stream under the parse_binary method to False, and sure enough, even with all auto_refresh set to True, works like a champ. Good catch again, that's the cause of the stalling.

But, to your suggestions. Good question. Setting auto_refresh False by default is tempting. Though the open requests.response object at new_binary.binary.data is most noticeable and problematic in a loop, it's nonetheless always present. It's prohibitevly slow to have stream set to False and load potentially very large data, but this open request.response object is not great either.

I was curious if the streamed connection could be manually closed, as a test, but not having much luck. Wondering if the connection is released to requests pool for the client, but FC4 still sees an open connection.

Though I like the idea of setting a refresh_binary flag for create(), it would almost have to default to False, which feels as though it would be confusing.

Need to step away at the moment, but ruminating on this, and I would propose just using auto_refresh=False when doing work with binary objects in bulk, which is faster anywhow.

Will revisit in the morning, and at the very least, bump version with this PR. Big thanks again for all your investigative work and substantive PRs.

ghukill commented 6 years ago

Alrighty @helrond , did a little work normalizing some of the auto_refresh flags in how they are set, and for unit tests. Bumped version to v0.6: https://pypi.org/project/pyfc4/0.6/.

Still feels a bit unfinished, but the flags and parameters are there to get what's needed done. Specifically, by setting auto_refresh to False when performing create() or update() should be enough to avoid hanging timeouts from request.response dangling from .binary object.

helrond commented 6 years ago

Great, thanks @ghukill - I think that's the best solution!