Hash GL_2 modular forms

jvoight commented 7 years ago

It would be useful to have a hash for GL_2-modular forms. (We used it quite often in the genus 2 work.)

We propose following a scheme like section 4.3 of https://arxiv.org/pdf/1602.03715: something like h(f) = sum_pp cpp*Tr{K/QQ}(a_pp(f)) mod P where K is the Hecke field, pp are primes canonically ordered in a range of norms between 4096 and 8192.

jwbober commented 7 years ago

That seems reasonable, but I don't think we're ever actually going to have that data for most classical modular forms, and we certainly won't have it soon. I have been computing a_p(f) for p < 2000, which is already more than we need to compute L-functions in the range I've been computing, and already takes a long time. I realize that there may be good reasons for skipping the smallest primes when computing this hash, but I wonder how much it really matters. (A hash wouldn't need to be the same for every type of modular form, though.)

Also, I would like to think that eventually a 64-bit hash will be too small, but we're probably pretty far from that being a problem.

AndrewVSutherland commented 7 years ago

There is no reason to force the hash to depend on primes in [2^12,2^13], we did that only for convenience in the early stages of the genus 2 project. I think it is useful to skip the very small primes so that you can compute the hash without having to determine Euler factors that might be particularly difficult to compute in extreme cases. I would be happy with a hash that uses the a_p values for primes p in [2^5,2^9], say.

This should still be more than enough data to reasonably extract 128 bits (I agree that 64 bits is not enough, and if we are going to define a new hash, we may as will move to 128 bits now).

arbooker commented 7 years ago

There are only 86 primes in [2^5,2^9]. Since JB has data up to 2000, how about [100,2000]?

AndrewVSutherland commented 7 years ago

@arbooker I'm thinking more generally about cases where getting a_p's for p up to 512 might be feasible but getting to 2000 might not be. I take it you are worried that 86 a_p's might not be enough to get 128 reasonably distributed bits. I suppose in CM cases a lot of the a_p's might be zero, but I note that there is still information in knowing which a_p's are zero (and our hash function captures this). I also note that when the Hecke field and/or the base field is larger than Q we are getting more information (side note: in J.V.s' suggestion, the hash coefficients c_pp depend on the base field, so we are really defining a different hash function for each base field; of course in the classical case where we will surely have the most data, the base field is always the same).

Do you really think there is a non-negligible chance that we will encounter two distinct L-functions whose a_p's coincide for p up to 512?

arbooker commented 7 years ago

@AndrewVSutherland It's unclear to me how many coefficients one should expect to need. Under GRH, any two eigenforms are distinguished by very few a_p's (O(log^2{N})). However, we're taking traces, which destroys the multiplicative structure, so it might be that Sturm's bound is the more relevant estimate. In light of that, perhaps we should take a sum over n rather than p?

Anyway, we can just compute it and see. I was mostly thinking that we might as well use the coefficients that we have. @jvoight Are there likely to be cases where it's feasible to compute up to 2^9 but not 2000?

Re: hash size, 64 bits has the advantage of being easily coded in C, without the need for gmp. Also, 128 bits would mean storing the hash as a string, right? I'm not saying that these factors are prohibitive, but we should consider whether 64 bits (actually 61 at present) is really insufficient before charging ahead. (Also, I think it's an odd combination to push for a longer hash but skimp on the hashing data.)

AndrewVSutherland commented 7 years ago

@arbooker

Using a_n's would definitely be better in terms of uniqueness but this would force us to compute all the small a_p's, which could be annoying -- recall that we started the hash at larger a_p's specifically so we could compute the hash at an early stage where we hadn't necessarily determined all the bad Euler factors.
Modern C compilers support 128-bit arithmetic, e.g. in gcc include to define __int128. You are correct that in mongo db we would need to store a 128 bit hash as a string, but I don't think that is a big deal -- doing exact match lookups on an indexed string should still be quite fast.
As far as sufficiency, with 61 bits the Birthday paradox says that even in the best case (perfectly uniform hash function) we will have a better than 50/50 chance of collisions with around 2 billion L-functions. This may be more L-functions than we are every going to want to store, but the chance of collision will be non-negligible much sooner (even with the 10-20 million L-functions we currently have it is not unthinkable). I guess the trade-off is between accepting that we may get a few collisions and being prepared to deal with it or using a hash that is large enough to make it astronomically unlikely.

arbooker commented 7 years ago

@AndrewVSutherland

We can do the same thing with a sum over n, i.e. restrict to (n,Q)=1, where Q is the product of small primes.
We would need a 128-bit mulmod. Doesn't that require a 128x128 -> 256 multiply? (Of course there are tricks if P is a Mersenne prime, but already by that point I would think that calling gmp is easier.) In my implementation, I'm already using __int128 to compute the 61-bit hash.
This hash does not distinguish individual forms in a Galois orbit. Modulo Galois, I think we will have far fewer than a billion forms.

AndrewVSutherland commented 7 years ago

@arbooker Sorry, I clearly need another cup of coffee -- yes we would need a 128-bit mulmod and yes it would be a pain to do that in C (but I don't see requiring GMP to be a big deal, heck, you need GMP to guild gcc). And you make a good point about the number of L-functions modulo Galois being smaller.

Regarding the a_n, if we used [2^5,2^9] restricting to (n,Q)=1 would be the same as restricting to n prime (and for the interval [100,2000] you suggested). Maybe we should consider [2^3,2^9]; that would give us 135 a_n's.

jwbober commented 7 years ago

For our purposes, two separate 64-bit hashes should be just as good as one 128-bit hash, I think.

I'm not exactly sure what we are looking for here. For hyperelliptic curve search purposes, it is important that it can be computed very quickly, and some collisions are ok. Is that what we want, or are we really trying for some globally unique identifier? (Maybe two separate 64-bit hashes solves both problems?)

On Wed, Jun 21, 2017 at 9:14 AM, Andrew Sutherland <notifications@github.com

wrote:

@arbooker https://github.com/arbooker Sorry, I clearly need another cup of coffee -- yes we need a 128-bit mulmod and yes it would be a pain to do that in C (but I don't see requiring GMP to be a big deal, heck, you need GMP to guild gcc). And you make a good point about the number of L-functions modulo Galois being smaller.

Regarding the a_n, if we used [2^5,2^9] restricting to (n,Q)=1 would be the same as restricting to n prime (and for the interval [100,2000] you suggested). Maybe we should consider [2^3,2^9]; that would give us 135 a_n's.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/LMFDB/lmfdb/issues/2123#issuecomment-310074335, or mute the thread https://github.com/notifications/unsubscribe-auth/ABXj4-UvDoZHC7gn9kji1RenE9FWyCtDks5sGRcqgaJpZM4N8ePF .

davidfarmer commented 7 years ago

Not sure exactly what the L-function hash will be used for, but a collision on the hash of coefficients is very likely to be resolved by looking at the functional equation data. (Conductor, Gamma-factors, etc).

AndrewVSutherland commented 7 years ago

@jwbober Two 64-bit hash functions is a good idea. We should make one as easy to compute as possible, with the expectation that in some cases there may not be enough ap's available to compute the second.

@davidfarmer The original intention of the hash (which was for our internal use) was to facilitate provisional matching of L-functions (e.g. to identify isogenous Jacobians, or to recognize isogeny factors) in the early stages where we might not yet know the conductor, root number, or some bad Euler factors (although we typically would know things like the Gamma-factors, degree, weight, etc..., that depend only on the "type" of the L-function (e.g. we know it is the L-function of a genus g curve)). But we have continued to find the hash useful in the production database.

For example, the L-function of the elliptic curve http://www.lmfdb.org/EllipticCurve/2.0.3.1/324.0.18/a/ is not currently in the LMFDB. But one can easily compute its hash, which is 1904600861308595235. If you go to http://www.lmfdb.org/Genus2Curve/Q/ and type #1904600861308595235 into the search by label box you will find the genus 2 curve http://www.lmfdb.org/Genus2Curve/Q/2916/a/ whose L-function has the same hash (in fact it's Jacobian is isogenous to the restriction of scalars of the elliptic curve above, something we can now prove but couldn't at the time we original computed these hashes).

So if a researcher has an arithmetic L-function that is not obviously in the LMFDB (e.g. it is the L-function of some strange motive they cooked up), if they can compute its hash they can then check to see whether there is an object in the LMFDB that appears to have the same L-function. To facilitate this use case, we would ideally like a way to provisionally identify matching L-functions even when the researcher doesn't yet know much about their L-function (in particular, they might not know what the conductor is, or be able to compute any zeros or special values, but if they can match it to an L-function in the LMFDB, they can at least provisionally obtain this information).

arbooker commented 7 years ago

I would prefer to think of this as a hash for modular forms (or maybe motives) rather than L-functions. Accordingly, it could be an attribute of a Galois conjugacy class of modular forms or isogeny class of elliptic curves, etc., but would not appear in the L-functions collection. For L-functions, @davidfarmer and I came up with a different sort of hash based on zeros, and I'd like to stick with that.

AurelPage commented 7 years ago

@arbooker Does the trace really destroy everything? Doesn't it just take the product of the conjugates of the L-function, so the same bound applies? From a Galois point of view, this is just viewing a Galois representation of dimension n defined over a field of degree d (say over Q_p), and seeing it as a Galois representation of dimension nd (defined over Q_p). Then we should still expect that O(log^2 N) Trace(a_p) caracterise this representation of dimension nd.

AndrewVSutherland commented 7 years ago

@arbooker I don't care what we call it, and I am not suggesting that it is in any way canonical or should supersede the L-function hash based on zeros (which is far more general), but I think it does make sense to have some sort of "arithmetic hash" based on a_p values that is stored in the Lfunctions collection, at least for L-functions with an Euler product (what I think David calls "arithmetic" L-functions) where it makes sense, so that we have a centralized way to search for it -- one can then browse all the instances of the L-function to find the corresponding arithmetic/analytic objects. We don't want to force the user to do a separate search for the hash in every collection of objects that have an arithmetic L-function.

arbooker commented 7 years ago

@AurelPage The problem with this argument is that the O-constant need not be uniform in the dimension, and in fact it might scale quite badly. Yes, we could multiply out all the Galois conjugate L-functions and apply Rankin-Selberg+GRH, but the product has conductor N^d, not N. That's probably worse than the Sturm bound for large d.

AurelPage commented 7 years ago

@arbooker I see, thanks for the explanation!

arbooker commented 7 years ago

@AndrewVSutherland Sure, there's no harm in having an optional extra field in the L-functions collection.

We'll need to change the proposed definition if we want to continue using the hash to match genus 2 curves and elliptic curves over quadratic fields. (The current proposal views the latter as an Euler product over Q(sqrt(d)) rather than Q.) Is that preferable?

edgarcosta commented 7 years ago

I think having an arithmetic hash and analytic hash for L-functions would be a good solution.

From what I understand, L-functions on the LMFDB are always taken as an euler product over QQ, thus from the L-function perspective, it seems more natural to have an hash that thinks of the euler product over QQ. However, if we go down that alley, it might be nontrivial to distinguish between galois conjugates.

jvoight commented 7 years ago

@jvoight Are there likely to be cases where it's feasible to compute up to 2^9 but not 2000?

The Hecke operators for GL2 are O(Npp) for fixed field degree; so given a form it is only about four times as much work to compute T{pp} on the order of N(pp)=2^9 vs. N(pp)=2000. I'd like to think that we'll have at least this much data for any (Hilbert) modular form in which we hope to compute something about the L-function, and I suppose it's not strictly necessary that we be able to compute the hash for those objects where there is a dramatic paucity of Hecke data.

edgarcosta commented 5 years ago

We have a column trace_hash form CMFs. We supposedly populate this column whenever we have exact data, i.e.,

dim <= 20 for weight > 1.
and any form for weight 1*

The last isn't true at the moment

SELECT COUNT(*) from mf_newforms where trace_hash is Null and weight = 1;
 count
-------
  1580
(1 row)

@AndrewVSutherland, do you know why?

AndrewVSutherland commented 5 years ago

@edgarcosta Because we did not compute exact eigenvalues for a_n with n in [2^13,2^14] for the A4,S4,A5 forms. These could be added without too much effort, but it will take some time -- it might be easier to just wait until @jwj61 computes all the Artin reps for these (if that is likely to happen soon).

roed314 commented 6 days ago

Can this be closed? We have trace_hash for CMFs.

AndrewVSutherland commented 6 days ago

The trace hash is still missing for 1389 non-dihedral weight 1 forms, but I will open a separate issue for this.

AndrewVSutherland commented 6 days ago

@edgarcosta I don't see the claim you reference about when trace hashes are available in the completeness knowl for CMFs, so I think it is okay for this data to be missing. I've created a new issue (tagged as a feature request) to fill this in for the 1385 non-dihedral weight-1 forms of dimension <= 20 where it is missing (the count of 1389 above includes 4 weight-1 newforms of dimension > 20, it probably isn't any harder to compute the trace hashes for those, since we compute exact eigenvalues for all weight-1 forms regardless of their dimension).

LMFDB / lmfdb

Hash GL_2 modular forms #2123