denio7 / egit

Automatically exported from code.google.com/p/egit
0 stars 0 forks source link

Errors in JGit jar cause fault on Solaris? #95

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?
1. Installed Eclipse 3.4.1 or 3.5RC3
2. Installed JGit from http://www.jgit.org/update-site/
3. Invoke File->Import, Git->"Git Repository"
4. Specify URI=git@github.com:baiker/nb_ads.git; protocol=git+ssh;
5. (Contact me to set up github access)
6. 'Import projects after clone' should be selected; selecting 'Next' will
start repository download.
7.  Observe SIGBUS on Solaris 10. On WinVista, Observe native exception in
the Eclipse log, Eclipse proceeds (apparently) normally.

What is the expected output? What do you see instead?
On Solaris 10, Eclipse core dumps.  JVM crash log attached.

What version of the product are you using? On what operating system?
This began after installing the integration build from the update site on
24 May.  The last EGit update installed on 5 June also produced the problem.

Please provide any additional information below.

Original issue reported on code.google.com by jwbito on 8 Jun 2009 at 6:03

Attachments:

GoogleCodeExporter commented 8 years ago
A java program cannot crash a properly working JVM so the bug must be in the 
JVM; as I
see it. A possible other location is one of the shared libraries Eclipse uses 
that
contain native code. I'm pointing fingers at the JVM (or Eclpse).

If it's a JVM bug you should be able to reproduce without Eclipse.
You could also try to reinstall your JDK / Eclipse to see if it somehow broken.

JVM bugs are reported to Sun. You could try another JVM also.

Original comment by robin.ro...@gmail.com on 11 Jun 2009 at 4:07

GoogleCodeExporter commented 8 years ago
I've checked the system logs and found no hardware fault
records.  The core dump occurs with JDK 1.6.0_12-b04 & 1.6.0_13-b03
and Eclipse 3.4.2 (M20090211-1700) & 3.5 RC3 (I20090528-2000) & 3.5 RC4
(I20090605-1444).  I submitted the JVM dump log on java.sun.com.  I submitted a
defect on Eclipse RC3; they resolved it saying 'not eclipse' and recommended 
sending
the problem to JGit, since the frame at the fault was in
org.spearce.jgit.lib.OffsetCache.getOrLoad

Running the same update of EGit on WinVista with Eclipse 3.4.2 and JDK
1.6.0_12-b04, I see two native method exceptions in the log.  On
WinVista, Eclipse runs without any message in the UI, though.

I've uninstalled Egit and can now the same workspace opens and I can do work on
Solaris.  I'll happily try to find the cause of this problem.  So far the 
symptom is
triggered only by Egit.  At the moment, I can't think of any further useful 
areas for
investigation, so please let me know if there's something you'd like to see.

Are the native mode faults I see logged on Windows (shown in the attached log) 
of no
concern?

Original comment by jwbito on 11 Jun 2009 at 2:39

GoogleCodeExporter commented 8 years ago
org.spearce.jgit_0.4.0.200903200852 is a old version. If it's broken it won't be
fixed. You have to try a new version. The latest integration build is from 
20090514.
Nevertheless any brokeness should not result in a JVM crash unless the JVM 
itself is
broken, either by native code in Eclipse (unlikely but possible), or something 
wrong
in the JVM (or faulty). JGit plays with the garbage collector here so it may 
well fin
a new bug in the JVM,

Try to build jgit manually using make_jgit.sh and then do a ./jgit clone 
your_url
destdir and see it that breaks too. 

Original comment by robin.ro...@gmail.com on 11 Jun 2009 at 9:14

GoogleCodeExporter commented 8 years ago
Thanks for pointing out the discrepancy with the jar file versions.  The 
install on
WinVista (where EGit is working for me) had a couple of different versions of 
EGit
jars. Now that I've removed the old jars, there's no native method exception on 
WinVista.

Now I have the same jars on both Solaris & WinVista:
org.spearce.egit.core_0.4.0.200906011726.jar
org.spearce.egit.ui_0.4.0.200906011726.jar
org.spearce.egit_0.4.0.200906011726.jar
org.spearce.jgit_0.4.0.200906011726.jar

The jgit CLI is able to clone the repository (please see below for a patch to
make_jgit.sh).  When I use Egit's 'import repository' on Solaris, it appears to 
clone
the git repository and core dumps before populating the working directory. (The
progress bar in Eclipse says 'checking out files'.)  I also get a core dump if 
I try
to open the workspace containing a project that was cloned by egit before this
problem cropped up (when I updated on May 14).  I suppose I could got back to a
version before that date to confirm that the problem doesn't occur.  Do you 
think
that would be useful?

=== Suggested patch to start CLASSPATH with '.'===
diff --git a/make_jgit.sh b/make_jgit.sh
index 2969e6e..ba5f6c7 100755
--- a/make_jgit.sh
+++ b/make_jgit.sh
@@ -58,15 +58,10 @@ then
 fi
 VN=`echo "$VN" | sed -e s/-/./g`

-CLASSPATH=
+CLASSPATH=.
 for j in $JARS
 do
-       if [ -z "$CLASSPATH" ]
-       then
-               CLASSPATH="$R/$j"
-       else
-               CLASSPATH="${CLASSPATH}${PSEP}$R/$j"
-       fi
+       CLASSPATH="${CLASSPATH}${PSEP}$R/$j"
 done
 export CLASSPATH

Original comment by jwbito on 12 Jun 2009 at 8:25

GoogleCodeExporter commented 8 years ago
A clarification: the Eclipse Git Plugin 0.4.0 (published on the updated site as
Release Build) does not cause this problem.

I'd like to help resolve it, but I could really use some advice as to practical 
next
steps.  Would it make sense to try and identify the change that causes the 
problem to
occur?  So far, I've been able to confirm that 0.4.0.200904240032 and 0.4.0 work
without a problem.

Original comment by jwbito on 15 Jun 2009 at 8:14

GoogleCodeExporter commented 8 years ago
git bisect is a very good tool to search for problems. 

See man git-bisect for details.

git checkout a bad version
git bisect start
open eclipse with this egit checkout
select the ui plugin and then select Run As Eclipse application
test it
if it fails => git bisect bad
if it works => git bisect good
exit the test eclipse (the one you launched 
git checkout a known good version
test it the same way and mark as good/bad.
The second time and onward git will automatically check out a new version for 
you to
test.

There is a chance that this bug will not show up when run this way, but you can 
always
hope for it.

Normally Eclipse will pick up the changes in the workspace, but to make sure 
please
perform a refresh before re-launching a new version of Eclipse to test.

You should have to test less than a dozen versions before the version that 
triggers
the Eclipse/JVM bug pops up.

Original comment by robin.ro...@gmail.com on 15 Jun 2009 at 9:15

GoogleCodeExporter commented 8 years ago
When I import the repository in the eclipse that's launched from Run As, it 
gets a
variety of NullPointerExceptions.  There was a message complaining that the egit
projects depend on JRE J2SE-1.5, but that isn't available.  

A variety of NullPointerExceptions occur whether the egit plugins are built at 
the
head or at v0.4.0 (which is working OK in the eclipse install) - there is no 
core dump.

Would it make sense to see how it behaves if I export the plugins to another 
eclipse
install?

Thanks for your suggestions!

Original comment by jwbito on 16 Jun 2009 at 2:26

GoogleCodeExporter commented 8 years ago
I was able to test by build the egit feature in Eclipse 3.5RC4, exporting it and
installing the feature.  The git bisect process yields:
bash-3.00$ git bisect good
2d77d30b5f5eca2b3087f1bab47fa9df2e64cd71 is first bad commit
commit 2d77d30b5f5eca2b3087f1bab47fa9df2e64cd71
Author: Shawn O. Pearce <spearce@spearce.org>
Date:   Wed Apr 29 11:54:46 2009 -0700

    Rewrite WindowCache to be easier to follow and maintain

    The integration of WindowCache, ByteWindow, PackFile and WindowCursor
    was a spaghetti of code that was impossible for even the original
    author (me) to follow.  Due to the way the responsibility for the
    PackFile's open RandomAccessFile "fd" was distributed between these
    four classes I could no longer prove to myself that the fd wouldn't
    be closed while it was being accessed by another thread.

    This rewrite generalizes most of the cache logic into a new class,
    OffsetCache.  The hope is that we can later reuse this code to make
    a rewrite of UnpackedObjectCache, which uses similiar caching rules
    as WindowCache, but applies a different hash function.  That rewrite
    is deferred to another change, but is anticipated by this one.

    The new OffsetCache class uses the Java 5 atomic APIs to create a
    much more concurrent hash table than we had before.  We can now
    perform no-miss reads without taking any locks.  Reads that do
    miss acquire a lock in order to prevent concurrent threads from
    performing duplicate work loading the same window from disk,
    however concurrent reads of different windows is still permitted.

    Due to the more concurrent nature of the OffsetCache, it is now
    possible for the cache to temporarily overshoot its resource limits.
    This is a small temporary overshoot that is roughly bounded by the
    number of concurrent threads operating against the same cache.

    The API of the ByteWindow subclasses is now simplified by removing
    the base class of SoftReference.  It was a horrible idea to pass
    the byte[] or MappedByteBuffer down through the call stack when the
    implementation knew what type it should be operating on.  We now
    instead use a more traditional OO pattern of allowing the subclass
    to directly specify its referent.

    Responsibility for the RandomAccessFile "fd" within PackFile is now
    strictly within PackFile.  Two open reference counts track how the
    callers are using the fd, ensuring that the fd remains open, so long
    as the caller has made the appropriate begin*() invocation prior
    to data access.  One counter, beginWindowCache() is exclusively
    for the ByteWindows created by WindowCache.  Another counter,
    beginCopyRawData(), is exclusively for PackWriter's need to lock
    the PackFile open while it performs object reuse.

    To keep the code simple a WindowCache.reconfigure() now discards the
    entire current cache, and creates a new one.  That invalidates every
    open file, and every open ByteWindow, and forces them to load again.

    Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
    Signed-off-by: Robin Rosenberg <robin.rosenberg@dewire.com>

:040000 040000 9a7427cc5c0d42573f23c660ca08240d5b5a0e71
bfee62c6cd4c9f9b6bb359c7799540f039478732 M      org.spearce.jgit.test
:040000 040000 1e23a88b41ce340519ffcc88f319115e592dca0b
388a17d9d4cc18323c8d6c934ab4c04cd5cf8040 M      org.spearce.jgit

Original comment by jwbito on 20 Jun 2009 at 5:53

GoogleCodeExporter commented 8 years ago
Cut things down to the minimum nessary. Ok, I'm not surprised, but regardless 
of what
we do here, this changes nothing regarding where the bug is. Assuming there is 
a bug
of some sort in JGit here, you must still have a bug in the JVM (or Eclipse) 
for the
JVM to crash here.

You can try to submit this to Sun as you claim it is repeatable. You need to 
package
it neatly into something they can just grab and run and see for themselves 
within a
minute or two, but it should really try to eliminate Eclipse from the equation 
to
avoid the blame game.

Cutting this down to an effective bug report will be an interesting exercise.

Original comment by robin.ro...@gmail.com on 22 Jun 2009 at 10:18

GoogleCodeExporter commented 8 years ago
The response by Robin is a bit off-putting.  If the crash weren't repeatable, I
wouldn't have tracked down the commit using git bisect.

It's not my code that's crashing Eclipse, so I would sincerely hope that the 
problem
is one that 'we' would be working to isolate together.  I've certainly 
appreciated
the guidance thus far and have done my best to follow it.  I'm definitely 
interested
in contributing to the improvement of EGit.  Today, Egit doesn't work on 
Solaris 10
with any of 3 different Java 6 runtimes and Eclipse 3.4.2 or 3.5 (RC3 & RC4).

You mention JGit.  Would it be useful (and feasible) to run a JGit test inside
Eclipse and outside Eclipse to see if that helps to locate the problem?  The 
jgit
clone test you suggested before did not cause the problem, although cloning 
with egit
crashes before it populates the working directory. (I don't know for sure, but 
it
appears to have fully populated the local (.git) repo.)

Right now, I'm trying to use the latest integration build (200906160801) to 
clone
http://repo.or.cz/r/egit.git and it seems to be hung on "Get pack-e45a866..idx: 
8% (
42/514)".  Please keep in mind that this is Solaris 10 on sparc.  On WinVista, I
haven't seen a problem since you pointed out that there was an old egit jar in 
my
plugins.

Thanks again for your guidance in trying to resolve this problem.
John

Original comment by jwbito on 22 Jun 2009 at 11:30

GoogleCodeExporter commented 8 years ago
[deleted comment]
GoogleCodeExporter commented 8 years ago
Note that I cannot fix bugs in th Sparc JVM. Sun can and there is a link in the 
hs_err
file to where you submit bug reports directly to Sun. Include all information 
about
versions and URL's for necessary downloads and see what response you get,

Running JGit from inside eclipse / outside probably won't affect the outcome, 
but then
I have no idea what triggers the bug in the JVM.

Original comment by robin.ro...@gmail.com on 27 Jun 2009 at 9:54

GoogleCodeExporter commented 8 years ago
Thank you for responding, Robin.  As I noted in my message to the group
<http://thread.gmane.org/gmane.comp.version-control.git/122265>, I have 
reproduced
the crash on all modern Sun JVMs and submitted the crash log to sun (two on the 
site
referenced in the hs_err file and two on bugs.sun.com).

I tried producing the crash with jgit from the command line and that was able to
complete.  I'll try it with the JVM versions that I got since I tested it last. 
 The
crash also occurs using Egit import on git://repo.or.cz/egit.git.

If you wish to leave this as a known issue with Egit 0.4.9, I cannot argue with 
you.
I was hoping that you'd be willing to work with me to find a workaround and/or a
specific test case that would motivate the Java team to accept it as a bug.

I would speculate that the problem is related to one of the bugs in nio that 
cause
errors when the code tries to access data on non-aligned boundaries
<http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=2172587>.  These can't 
happen on
x86 hardware, since it (generally) allows non-aligned access.

Original comment by jwbito on 27 Jun 2009 at 5:47

GoogleCodeExporter commented 8 years ago
In general you don't (read: should not) work around bugs. Bugs should be fixed.
Workarounds are a last desperate resort. 

The sorry state of much of the software around us comes from working around bugs
instead of fixing them. After a while you get of mess of workarounds for 
workarounds.

It gets more complicated when the bug is in software we cannot (?, i.e.g 
openjdk) fix,
but the principle should be the same.

If you want to work around it as an intermediate measure you can try reverting 
the
commit that you identified. That may be a non-trivial, but probably not very 
hard.
That revert will not be part of JGit however.

We can keep this open until we can attach a reference to a SUN bug report.

Original comment by robin.ro...@gmail.com on 2 Jul 2009 at 3:22

GoogleCodeExporter commented 8 years ago
I have verified that the problem is also caused by the org.eclipse
(0.5.0.200908141101) version of the plugin with JDK 1.6.0_16 and also JDK 1.7
milestone 4 running in Eclipse 3.5.

Original comment by jwbito on 21 Aug 2009 at 6:06

GoogleCodeExporter commented 8 years ago
Any version after the commit identified will probably trigger the JVM bug. Did 
you
send a bug report to Sun?

Original comment by robin.ro...@gmail.com on 21 Aug 2009 at 9:01

GoogleCodeExporter commented 8 years ago
As I mentioned earlier, Ive submitted a number of reports, including for 
1.6.0_10,
1.6.0_16 and a couple of versions of 1.7 including Milestone 4 (which is what 
was
available this morning).

It's not surprising that Sun hasn't looked at it; I doubt they consider Eclipse 
to be
a small test case.

As I've said before, I'd like to help get to the specific code that has the 
problem,
but I'd need some clear tasks.

Original comment by jwbito on 21 Aug 2009 at 9:10

GoogleCodeExporter commented 8 years ago
Hey - I'm able to use plugin version 0.5.0.200908282229 on SPARC (Solaris 10) 
with
the EA release of 1.6.0u18!

Original comment by jwbito on 31 Aug 2009 at 5:18

GoogleCodeExporter commented 8 years ago
Good. Please close.

Original comment by robin.ro...@gmail.com on 31 Aug 2009 at 6:58

GoogleCodeExporter commented 8 years ago
I think closing the issue requires a privilege I don't have.  I added a comment 
in
the Egit wiki to let folks know they may encounter a problem on sparc.

Original comment by jwbito on 31 Aug 2009 at 9:25

GoogleCodeExporter commented 8 years ago
not our bug.

Original comment by robin.ro...@gmail.com on 2 Sep 2009 at 7:47