read-errors on Tycho-access

afischer211 commented 9 years ago

I have a running Package-Drone repository (0.10.2) with >10000 artifacts stored in different channels. The database is PostgreSQL 9.x

Sometimes I get errors on loading an OSGi-bundle from a channel over Tycho-build: [ERROR] An error occurred while transferring artifact canonical: osgi.bundle,, from repository /p2/: [ERROR] Unable to read repository at **/p2//plugins///. [ERROR] Internal error: org.eclipse.tycho.repository.local.MirroringArtifactProvider$MirroringFailedException: Could not mirror artifact osgi.bundle,, into the local Maven repository.See log output for details. Read timed out -> [Help 1]

Why the read-access goes over the path with groupid and version? The bundle is located directly in the channels under p2/!?

On the pdrone-console I see the following log-output (the named bundle-jarfile is listed and downloadable in the channel-content-view!): 11:08:03.518 [qtp447351078-124] WARN org.eclipse.jetty.server.HttpChannel [HttpChannel.java:395] /p2//plugins/// javax.servlet.ServletException: javax.servlet.ServletException: java.lang.RuntimeException: java.lang.RuntimeException: java.lang.RuntimeException: java.lang.Ru ntimeException: org.eclipse.jetty.io.EofException at org.ops4j.pax.web.service.jetty.internal.JettyServerHandlerCollection.handle(JettyServerHandlerCollection.java:88) ~[org.ops4j.pax.web.pax-web-jetty_ 4.1.2.jar:na] at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) ~[org.eclipse.jetty.server_9.2.9.v20150224.jar:9.2.9.v20150224] at org.eclipse.jetty.server.Server.handle(Server.java:497) ~[org.eclipse.jetty.server_9.2.9.v20150224.jar:9.2.9.v20150224] at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310) ~[org.eclipse.jetty.server_9.2.9.v20150224.jar:9.2.9.v20150224] at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257) [org.eclipse.jetty.server_9.2.9.v20150224.jar:9.2.9.v20150224] at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540) [org.eclipse.jetty.io_9.2.9.v20150224.jar:9.2.9.v20150224] at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635) [org.eclipse.jetty.util_9.2.9.v20150224.jar:9.2.9.v20150224] at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555) [org.eclipse.jetty.util_9.2.9.v20150224.jar:9.2.9.v20150224] at java.lang.Thread.run(Thread.java:745) [na:1.8.0_45] Caused by: javax.servlet.ServletException: java.lang.RuntimeException: java.lang.RuntimeException: java.lang.RuntimeException: java.lang.RuntimeException: org.e clipse.jetty.io.EofException at de.dentrassi.pm.p2.internal.servlet.P2Servlet.process(P2Servlet.java:201) ~[na:na] at de.dentrassi.pm.p2.internal.servlet.P2Servlet.doGet(P2Servlet.java:148) ~[na:na] at javax.servlet.http.HttpServlet.service(HttpServlet.java:687) ~[javax.servlet-api_3.1.0.jar:3.1.0] at de.dentrassi.pm.p2.internal.servlet.P2Servlet.service(P2Servlet.java:73) ~[na:na] at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) ~[javax.servlet-api_3.1.0.jar:3.1.0] at de.dentrassi.osgi.web.servlet.ServletWrapper.service(ServletWrapper.java:64) ~[na:na] at de.dentrassi.osgi.web.servlet.JspServletInitializer$BundleServletWrapper.service(JspServletInitializer.java:53) ~[na:na] at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:808) ~[org.eclipse.jetty.servlet_9.2.9.v20150224.jar:9.2.9.v20150224] at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:587) ~[org.eclipse.jetty.servlet_9.2.9.v20150224.jar:9.2.9.v20150224] at org.ops4j.pax.web.service.jetty.internal.HttpServiceServletHandler.doHandle(HttpServiceServletHandler.java:70) ~[org.ops4j.pax.web.pax-web-jetty_4.1. 2.jar:na] at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) ~[org.eclipse.jetty.server_9.2.9.v20150224.jar:9.2.9.v20150224] at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577) ~[org.eclipse.jetty.security_9.2.9.v20150224.jar:9.2.9.v20150224] at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223) ~[org.eclipse.jetty.server_9.2.9.v20150224.jar:9.2.9.v20150224] at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127) ~[org.eclipse.jetty.server_9.2.9.v20150224.jar:9.2.9.v20150224] at org.ops4j.pax.web.service.jetty.internal.HttpServiceContext.doHandle(HttpServiceContext.java:267) ~[org.ops4j.pax.web.pax-web-jetty_4.1.2.jar:na] at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515) ~[org.eclipse.jetty.servlet_9.2.9.v20150224.jar:9.2.9.v20150224] at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) ~[org.eclipse.jetty.server_9.2.9.v20150224.jar:9.2.9.v20150224] at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061) ~[org.eclipse.jetty.server_9.2.9.v20150224.jar:9.2.9.v20150224] at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) ~[org.eclipse.jetty.server_9.2.9.v20150224.jar:9.2.9.v20150224] at org.ops4j.pax.web.service.jetty.internal.JettyServerHandlerCollection.handle(JettyServerHandlerCollection.java:80) ~[org.ops4j.pax.web.pax-web-jetty_ 4.1.2.jar:na] ... 8 common frames omitted Caused by: java.lang.RuntimeException: java.lang.RuntimeException: java.lang.RuntimeException: java.lang.RuntimeException: org.eclipse.jetty.io.EofException at de.dentrassi.pm.common.service.AbstractJpaServiceImpl.doWithTransaction(AbstractJpaServiceImpl.java:124) ~[na:na] at de.dentrassi.pm.storage.service.jpa.StorageServiceImpl.streamArtifact(StorageServiceImpl.java:358) ~[na:na] at de.dentrassi.pm.storage.service.jpa.ArtifactImpl.streamData(ArtifactImpl.java:86) ~[na:na] at de.dentrassi.pm.storage.service.util.DownloadHelper.streamArtifact(DownloadHelper.java:64) ~[na:na] at de.dentrassi.pm.p2.internal.servlet.DownloadHandler.process(DownloadHandler.java:97) ~[na:na] at de.dentrassi.pm.p2.internal.servlet.P2Servlet.process(P2Servlet.java:197) ~[na:na] ... 27 common frames omitted Caused by: java.lang.RuntimeException: java.lang.RuntimeException: java.lang.RuntimeException: org.eclipse.jetty.io.EofException at de.dentrassi.pm.common.service.AbstractJpaServiceImpl.lambda$1(AbstractJpaServiceImpl.java:113) ~[na:na] at de.dentrassi.pm.common.service.AbstractJpaServiceImpl$$Lambda$19/884281141.process(Unknown Source) ~[na:na] at de.dentrassi.pm.common.service.AbstractJpaServiceImpl.doWithManager(AbstractJpaServiceImpl.java:48) ~[na:na] at de.dentrassi.pm.common.service.AbstractJpaServiceImpl.doWithTransaction(AbstractJpaServiceImpl.java:68) ~[na:na] ... 32 common frames omitted Caused by: java.lang.RuntimeException: java.lang.RuntimeException: org.eclipse.jetty.io.EofException at de.dentrassi.pm.storage.service.jpa.blob.FilesystemBlobStoreProcessor.streamArtifact(FilesystemBlobStoreProcessor.java:207) ~[na:na] at de.dentrassi.pm.storage.service.jpa.blob.BlobStore.streamArtifact(BlobStore.java:193) ~[na:na] at de.dentrassi.pm.storage.service.jpa.blob.BlobStore.streamArtifact(BlobStore.java:183) ~[na:na] at de.dentrassi.pm.storage.service.jpa.StorageServiceImpl.lambda$14(StorageServiceImpl.java:367) ~[na:na] at de.dentrassi.pm.storage.service.jpa.StorageServiceImpl$$Lambda$112/878977731.process(Unknown Source) ~[na:na] at de.dentrassi.pm.common.service.AbstractJpaServiceImpl.lambda$1(AbstractJpaServiceImpl.java:80) ~[na:na] ... 35 common frames omitted Caused by: java.lang.RuntimeException: org.eclipse.jetty.io.EofException at de.dentrassi.pm.storage.service.util.DownloadHelper.lambda$0(DownloadHelper.java:88) ~[na:na] at de.dentrassi.pm.storage.service.util.DownloadHelper$$Lambda$111/2059938070.receive(Unknown Source) ~[na:na] at de.dentrassi.pm.storage.service.jpa.blob.BlobStore.lambda$1(BlobStore.java:183) ~[na:na] at de.dentrassi.pm.storage.service.jpa.blob.BlobStore$$Lambda$113/1554537986.accept(Unknown Source) ~[na:na] at de.dentrassi.pm.storage.service.jpa.blob.FilesystemBlobStoreProcessor.streamArtifact(FilesystemBlobStoreProcessor.java:199) ~[na:na] ... 40 common frames omitted Caused by: org.eclipse.jetty.io.EofException: null at org.eclipse.jetty.io.ChannelEndPoint.flush(ChannelEndPoint.java:192) ~[org.eclipse.jetty.io_9.2.9.v20150224.jar:9.2.9.v20150224] at org.eclipse.jetty.io.WriteFlusher.flush(WriteFlusher.java:408) ~[org.eclipse.jetty.io_9.2.9.v20150224.jar:9.2.9.v20150224] at org.eclipse.jetty.io.WriteFlusher.completeWrite(WriteFlusher.java:364) ~[org.eclipse.jetty.io_9.2.9.v20150224.jar:9.2.9.v20150224] at org.eclipse.jetty.io.SelectChannelEndPoint.onSelected(SelectChannelEndPoint.java:111) ~[org.eclipse.jetty.io_9.2.9.v20150224.jar:9.2.9.v20150224] at org.eclipse.jetty.io.SelectorManager$ManagedSelector.processKey(SelectorManager.java:636) ~[org.eclipse.jetty.io_9.2.9.v20150224.jar:9.2.9.v20150224] at org.eclipse.jetty.io.SelectorManager$ManagedSelector.select(SelectorManager.java:607) ~[org.eclipse.jetty.io_9.2.9.v20150224.jar:9.2.9.v20150224] at org.eclipse.jetty.io.SelectorManager$ManagedSelector.run(SelectorManager.java:545) ~[org.eclipse.jetty.io_9.2.9.v20150224.jar:9.2.9.v20150224] at org.eclipse.jetty.util.thread.NonBlockingThread.run(NonBlockingThread.java:52) ~[org.eclipse.jetty.util_9.2.9.v20150224.jar:9.2.9.v20150224] ... 3 common frames omitted Caused by: java.io.IOException: Datenübergabe unterbrochen (broken pipe) at sun.nio.ch.FileDispatcherImpl.write0(Native Method) ~[na:1.8.0_45] at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) ~[na:1.8.0_45] at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) ~[na:1.8.0_45] at sun.nio.ch.IOUtil.write(IOUtil.java:65) ~[na:1.8.0_45] at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471) ~[na:1.8.0_45] at org.eclipse.jetty.io.ChannelEndPoint.flush(ChannelEndPoint.java:170) ~[org.eclipse.jetty.io_9.2.9.v20150224.jar:9.2.9.v20150224] ... 10 common frames omitted

Other bundles can loaded without problems. I have cleaned and refilled the channels before because of assuming some inconsistencies with the local storage or so. This does not help.

Are this race-conditions or problems with concurrent access to channels over p2/tycho?

afischer211 commented 9 years ago

How does pdrone ensure, that the consumer process (Tycho-build) receives a consistent artifact? I have other concurrently running processes (on other Jenkins-slave machines), which deploys artifacts into the channels on the same time. If the channel-aspect recalculates the P2-indexfiles or clean old artifacts, how is the transaction concept ensured?

ctron commented 9 years ago

Well the EofException indicates that the sender had trouble to sends its data. Looking at the root cause (broken pipe) it seems that the remote side has closed the connection (the receiving side). Which could, of course be, due to a timeout situation.

ctron commented 9 years ago

Package Drone has a read/write lock for each channel.

afischer211 commented 9 years ago

That is clear, but what is the reason for the read-timeout (the Tycho-build aborts after timeout and so pdrone can not send any datas to client)? Why needs pdrone sometimes such a long time for providing artifacts (other artifacts are provided quickly). Can I analyze the timing problems deeper? It seems, that pdrone needs sometime a big amount of time for providing an artifact, so that the Tycho-build aborts with read-timeout.

afischer211 commented 9 years ago

The linux-machine (CentOS7 with 64Bit) runs with a load-average of 2,25 Is this problem a tuning-task for the postgresql-database or a memory-bottleneck of pdrone itself (it runs currently with -Xmx5120m, the machine has 10GB of RAM and 2,5GB free RAM)?

ctron commented 9 years ago

This is indeed strange, and should not be the case!

What could be the case, is that due to the read/write lock, one build tries to "read" and the other to "write". If this write lock is granted first, all reads have to wait. This could delay the read and force the reader (Tycho) into a timeout situation.

I will check how this can be traced. But I cannot do it right now.

afischer211 commented 9 years ago

At the moment I have many running build-jobs. In the console-log of pdrone I see many line like: 12:41:17.814 [qtp447351078-129] WARN d.d.pm.p2.internal.servlet.P2Servlet [P2Servlet.java:144] Download plugin: /thirdparty/plugins/org.apache.commons.codec/1.9.0/org.apache.commons.codec_1.9.0.jar What does this warning-message mean?

ctron commented 9 years ago

This is actually not an "warning". I just lowered it to "debug".

afischer211 commented 9 years ago

I can see the long delay on downloading artifacts from pdrone sometimes also with browser. So it seems to be a temporary lock problem / race condition within pdrone.

ctron commented 9 years ago

Well, it is less a problem ;-) in the traditional sense. While the channel is updating, it is also locked for further read calls. If the modify calls take 2 minutes, then the read calls will also be blocked for this time.

It might also be, that a few sequential write operations, might stretch this period even longer.

I guess the only possible way around this, is to provide the "old" channel state for reading, while the new, modified, channel state will be prepared in parallel. I am currently working in the persistence layer, so maybe this is a good use case to add.

afischer211 commented 9 years ago

Yes, this would become more like transactional database (ACID-concept). So I see the old valid & consistent state until completition of the transaction. The existing (possible but not perfect) way is to block all other transactions - so I also ensure validity of read datas.

ctron commented 9 years ago

Well, it still is ACID right now. Just with a sub-optimal locking. Which on the other hand was necessary for the deadlock situation of Postgres and MySQL.

The whole database situation is quite annoying actually. There are a few quirks with JPA, OSGi, and JDBC in general. The file blob store, possible scenarios with S3, Hadoop etc ... so we are currently evaluating a model with does not require a database in the first place. This could be just another use case for that.

afischer211 commented 9 years ago

I have the following situation: 4 build-jobs, which deploys to different product-channels on pdrone. Now I try to download an artifact from another channel with thirdparty-libraries - it needs over 2min for getting the filechooser for selecting the targetfolder for download. So my question is, what blocks pdrone for over 2mins, if the build-jobs deploys to other channels. All channels have the cleanup-aspect together with p2, tycho ... Are the aspects multithreaded or blocks a running aspect also other channels?

afischer211 commented 9 years ago

My hope with pdrone was, that the usage of a database helps for concurrent access on the osgi-artifacts. The simple eclipse p2-repositories on filesystems behind a web-server works stable and quick, but have problems with concurrent access from different jobs (especially the index-files artifacts.xml and content.xml are critical).

afischer211 commented 9 years ago

You can see on my usage-statistics, that we have >10000 artifacts in pdrone distributed in 13 channels. The most problematic channels (seems to be 2) have over 2000 artifacts in itself. Is there a possible performance-problem with more than 2000 artifacts in a channel and activated cleanup- and p2/osgi-aspects?

afischer211 commented 9 years ago

One remark to the problem, which could be helpful: It needs over 2min to download an artifact over URL http://server:8080/p2/common/plugins/de.pcsoft.common.datasystem.base.api/3.10.3.201507271614/de.pcsoft.common.datasystem.base.api_3.10.3.201507271614.jar A download over URL http://server:8080/artifact/74E43A08-CC62-43C9-82BA-626708EB3C77/get runs quickly in the same time!

The first URL is used by Tycho for downloading an artifact, the second is used in the Web-GUI of PDrone.

ctron commented 9 years ago

Hm, ok .. this actually could be different situation. I will check into this.

ctron commented 9 years ago

Actually the P2 adapter has to resolve the artifact, and cannot reference to it. A limitation of the P2 mechanism, but maybe the resolving is not as performant as it could be.

ctron commented 9 years ago

I did upload a new pre-release [1] which uses a different, more efficient way, to list all artifacts in order for resolving the id/version combination of P2.

You should be able to switch back and forth between your current version and this one. And this is the only change. So if you can give it a try, this would help me a lot, since I currently don't have much time.

[1] https://www.wuala.com/ctron/drone/?key=rTnBx1xnFmFD

afischer211 commented 9 years ago

I have installed the new version, the build-jobs are running. A first impression looks good.

ctron commented 9 years ago

Do you see any improvement? I would make a new release then.

afischer211 commented 9 years ago

We can close the issue. Also with 4 concurrently running jobs (over last night) there are no read-problems anymore. Thanks for the quick solution. Can you also integrate the logging-patch (modified logging-level) in the next release?

ctron commented 9 years ago

So I will make a new release.

The 0.10.2 release should have had the log issue already fixed.

ctron / package-drone

read-errors on Tycho-access #120