apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.68k stars 1.03k forks source link

we should include checksums for every jar ivy fetches in svn & src releases to verify the jars are the ones we expect [LUCENE-3945] #5018

Closed asfimport closed 12 years ago

asfimport commented 12 years ago

Conversation with rmuir last night got me thinking about the fact that one thing we lose by using ivy is confidence that every user of a release is compiling against (and likely using at run time) the same dependencies as every other user.

Up to 3.5, users of src and binary releases could be confident that the jars included in the release were the same jars the lucene devs vetted and tested against when voting on the release candidate, but with ivy there is now the possibility that after the source release is published, the owner of a domain where these dependencies are hosted might change the jars in some way w/o anyone knowing. Likewise: we as developers could commit an ivy.xml file pointing to a specific URL which we then use for and test for months, and just prior to a release, the contents of the remote URL could change such that a JAR included in the binary artifacts might not match the ones we've vetted and tested leading up to that RC.

So i propose that we include checksum files in svn and in our source releases that can be used by users to verify that the jars they get from ivy match the jars we tested against.


Migrated from LUCENE-3945 by Chris M. Hostetter (@hossman), resolved Apr 04 2012 Attachments: LUCENE-3945_trunk_jar_sha1.patch (versions: 3), LUCENE-3945.patch (versions: 3)

asfimport commented 12 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

1: I know that Ivy attempts MD5 & SHA1 verification by default – but it does that verification against checksum files located on the server, so it only offers protection against corruption in transit, not against files deliberately modified on the server.

#2 i realize that the maintainers of maven repos say "all files are immutable" and that this potential risk of malicious or accidental file changes exists for all maven users – but that's the choise of all maven users to accept that as a way of life. I'm raising this issue only to point out a discrepancy between the "confidence" we use to be able to give people who download src releases, vs what we have currently with ivy.

asfimport commented 12 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

Here's a strawman suggestion for adding a SHA1 check to our custom LicenseCheckTask, it was inspired by some of the code in ant's <checksum/> task, but it's not as configurable (only supports one SHA1 and only the simplest file format)

I put this code in LicenseCheckTask because:

If people want to pursue this, We can always refactor – either to generalize the name/description of of "LicenseCheckTask" or to refactor this out into it's own custom task (and do an extra jar crawl).

(NOTE: current patch doesn't actually include the checksums themselves, so as is this will fail the build. If we trust what "ant resolve" is pulling down right now, the files would be trivial for anyone to generate, but i would suggest being a little more diligent and and generating the SHA1s against what was committed in svn up until last week – except obviously where we've actually changed which jar/version we use because of the ivy work)

Comments?

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

+1, we really need to do this. it can also detect ivy cache corrumption.

its really unrelated to actually how we get the jars, thats an impl detail. we should be checking that its the jar we tested against, or fail hard.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

If people want to pursue this, We can always refactor – either to generalize the name/description of of "LicenseCheckTask" or to refactor this out into it's own custom task (and do an extra jar crawl).

I don't think we should do an extra crawl, we can just name it DependencyCheckTask. Dependencies need to have licensing information and sha checksums, and 'ant validate' fails if they don't.

asfimport commented 12 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

reader = new BufferedReader(new FileReader(f));

Isn't this locale-sensitive? I think it should be explicit UTF-8 (or US-ASCII for that matter).

+      String hexStr = Integer.toHexString(CHECKSUM_BYTE_MASK & digest[i]);
+      if (hexStr.length() < 2) {
+        checksum.append("0");
+      }
+      checksum.append(hexStr);

Isn't any of these simpler?

checksum.append(String.format(Locale.ENGLISH, "%02x", CHECKSUM_BYTE_MASK & digest[i]));

or

char [] HEX = "0123456789abcdef".toCharArray();
int v = digest[i];
checksum.append(HEX[(v >>> 4) & 0x0F]).append(HEX[v & 0x0F]);
asfimport commented 12 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Btw. you can also avoid a recrawl by passing a refid of the same fileset to two tasks rather than constructing a new one in each. I don't mind renaming the class either.

asfimport commented 12 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

it can also detect ivy cache corrumption.

I have encountered cache corrumption: AFAICT a truncated jar file (tika-parsers-1.0.jar); killing the cache fixed the problem. We could add a clean-ivy-cache target calling the target Ivy provides.

asfimport commented 12 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

Isn't any of these simpler?

FWIW that code was basically verbatim from ant's Checksum.java, but yeah: your way is better

Isn't this locale-sensitive? I think it should be explicit UTF-8

Also verbatim, but it actually caught my eye and I thought it was intentional to deal with line ending differences in diff OSes, but your comment makes me realize thta makes no sense – and BufferedReader chomps all possible endings equally regardless of locale/encoding.

we can just name it DependencyCheckTask.

I think in the long run we should do that, and rename the taskdef, and rename the macrodef, ... but for 3.6 it might be better to keep the changes to a minimum and just add the logic to the existing class and add the SHA1 files but not rename anything.

...updated patch with those changes ... going to start compiling the list of all SHA1s files on trunk

asfimport commented 12 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

bah ... previous patches were broken in that they wouldn't actaully fail the build when checksum failures were detected.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Haven't tested but looks good: I also think its a good idea to keep the old name for now, we can rename later.

unrelated but couldnt help but notice, this file has no license header. This is because the current rat tasks ignore tools/....grrrrr :)

asfimport commented 12 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

previous patch with the addition of SHA1 files for all jars currently used on trunk.

SHA files were generated using...

ant clean resolve 
find -name \*.jar | xargs -L 1 sha1sum | awk '{ system("echo " $1 ">> " $2 ".SHA1") }'
find -name \*.SHA1 | xargs svn add

all tests pass these jars, but one thing i notice is that the classpath logic in most places seems to be based on exclusion instead of inclusion (ie: 'excludes=".txt,.template"' instead of 'includes="*.jar"') so with these SHA1 files added, we now get a metric shit ton of "ZipException: error in opening zip file" warnints logged by junit because it's trying to parse these files as jars

should we fix all these classpaths, or rename the files "*.SHA1.txt" ?

(note: i haven't even looked at the packaging yet to verify if the SHA1 files are being included in the src builds and excluded from the bin builds)

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

well ideally i think it would just be .sha1 (lowercase), this matches the extension used for signing our release artifacts...

The reason for exclusion versus inclusion is in the case the jars don't actually exist yet, its important!

but in lucene/common-build.xml i think we should just set up a single libexcludes ant property or something:

**/*.txt,**/*.template,**/*.sha1

and just use that for these filesets

fileset dir="foo" excludes="${libexcludes}"

or whatever. and yeah libexcludes is a crappy name, its just the idea.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

and yeah, we need to test and likely modify the binary dist patterns as you hinted at.

asfimport commented 12 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

revised patch, switches the file names to sha1, and adds a "common.classpath.excludes" property thta i started using everywhere i could find that made sense.

I've verified that this keeps those sha1 files out of the solr war, and so far it looks good for all the tests ... would be helpful if someone with a faster computer could sanity check that (Note: test won't fail if the sha1s are in the classpath – you'll just get a ZipException in the console that you have to grep for)

asfimport commented 12 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

1) fixes a few places ha1 files were still getting used in solr contrib test classpaths 2) fixes the binary releases to exlcude the sha1 files 3) verified that the src releases should work fine (they do an svn export so once the files are commited we should be good) 4) package-local-src-tgz currently includes the sha1 files (but it also currently includes the jars if you don't do clean-jar first ... not really something i'm going to worry about here)

I think we're good to go?

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

+1

I opened #5025 and we can later look at an other similar minor improvements related to that on other issues (like compile-tools, called before validate, likely tries to use the ant.jar)...

some of that we might just have to fix in a later release, at least it wont silently work wrong (worst case you get a zip error).

asfimport commented 12 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

Committed revision 1309503. - trunk

rmuir said on irc that he'd work on backporting to 3x for me (going to grab some lunch soon and then get on a plane)

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Backported to 3.x

Thanks Hoss!