Cannot create top/bottom knowls for objects whose labels are not valid knowl ids

AndrewVSutherland commented 8 years ago

I was originally going to add an issue for not having top knowls for elliptic curves over number fields, but in fact, even if this functionality was added to ecnf, it wouldn't work, because many elliptic curves have labels that contain characters such as "[" and "]" that are not allowed in a knowl id.

For example, you cannot create the knowl: "ec.nf.[11,5,1]-a1.top"

This is not specific to elliptic curves over number fields, it causes a problem with Sato-Tate groups, and is likely to cause problems elsewhere as well. Anything character that can appear in a label should be able to appear in a knowl (in particular, we definitely need "(", ")", "[", "]", "{", "}" and possibly "^")

JohnCremona commented 8 years ago

I know now that my labels for ideals of imaginary quadratic fields are really bad, and this is another example. (Another is the problem which can arise having those [,] characters in a URL.) The whole issue of labelling ideals is one which we have spent a lot of time on but never actually done anything about. It will not happen this week. The problem raised is only as issue for imaginary quadratics since for all other fields the conductor part of the label has the form N.i where N is the norm and i an index from 0 (and of course the issue is to know what order the ideals of a given norm are in).

AndrewVSutherland commented 8 years ago

@JohnCremona This issue isn't about whether your ideal labels are good or bad, it is simply about the fact that we have imposed restrictions on knowl ids that are incompatible with many of our existing labels (which we are surely not going to change, especially after May 10). We need to remove the restriction on knowl ids that prevents us from creating top/bottom knowls for objects whose labels contain characters not currently permitted in a knowl id. This is not specific to elliptic curves over imaginary quadratic fields, that just happens to be where I first noticed it.

JohnCremona commented 8 years ago

OK. I was looking for the place where the exact restrictions on knowl ids is explained but could not find it. It would actually be quite easy for this purpose to translate a label such as "ec.nf.[11,5,1]-a1.top" into (say) "ec.nf.11.5.1-a1.top" -- though the knwl ID should include the field label too, so an example would be to go from the curve whose full label is 2.0.4.1-[65,18,1]-a1 to the knwol id "ec.nf.2.0.4.1-65.18.1-a1" by just doing id = id.replace(",",".").replace("[","").replace("]","")

AndrewVSutherland commented 8 years ago

All that is required to fix this is to change line 38 of lmfdb/lmfdb/knowledge/knowl.py

Currently we have: "allowed_knowlid = re.compile("^[a-z0-9.-]+$")"

JohnCremona commented 8 years ago

OK -- we need Harald's input before doing that since he may have had a reason.

AndrewVSutherland commented 8 years ago

As another example, I'll note that you also cannot create a top knowl for the object

http://www.lmfdb.org/EllipticCurve/3.1.23.1/89.1/A/1

because the label name 89.1-A1 contains an upper case character. I'm sure there are many similar examples throughout the LMFDB. If top knowls are going to become a standard feature (which I believe they should) we are going to have to either remove the restriction on knowl ids, or come up with some well-defined mapping from labels to allowed knowl ids.

This strikes me as problematic, since such a mapping is not likely to be invertible.

JohnCremona commented 8 years ago

That is true -- Gunnells & Yasaki used upper case for their isogeny class labels, which I had forgotten.

AndrewVSutherland commented 8 years ago

Just so I don't forget, once we get this fixed we should add top/bottom knowls to ecnf pages (I can make this a separate issue if desired).

JohnCremona commented 8 years ago

Yes, please make it a separate Issue. It could even be implemented now, so that if Harald says we can relax the restriction on knowl ids it can go in right away.

davidfarmer commented 8 years ago

I like Cremona's second comment: replace capital letters by lower case, comma by period, other illegal characters by an underscore, and maybe some other transformations.

In many situations knowl IDs become filenames, and there are a lot of characters you don't want in a filename. Even if our system allows more flexible knowl IDs, because the content comes from a database, I would not favor allowing that because it is not portable.

haraldschilly commented 8 years ago

so, besides +1 to @davidfarmer's comment, I have no clear idea. There might be some parsing going on somewhere, but maybe that's for hastags and not for knowls. Besides that, the most important aspect is, that this does end up as an URL path. So, it's usually a good usability touch to treat URLs are case-insensitive. So, I wouldn't recommend that there is an uppercase A in the URL example by @AndrewVSutherland . Maybe, there should be an string.lower() to do this first step of canonicalization when handing that URL. Although, of course, that's a bad idea if there is also a different lower case a.

There is super non-helpful RFC for looking up what a valid URL is: http://www.ietf.org/rfc/rfc3986.txt

tl;dr: I would suggest to do simple string replacements where necessary, e.g. .replace("[", "").lower()... and if you really want to relax the knowl IDs, never ever include magic characters like / ? # =

AndrewVSutherland commented 8 years ago

@davidfarmer I'm not aware of any modern file system that will not allow any suitably quoted printable character in the file name, so I don't see this as an issue.

@haraldschilly I agree that URL's are exactly the right test case, since every label must be converted to a URL, and I would suggest that this is exactly what we should do to make them knowl ids. Below are some examples of objects that have web pages but cannot be given knowls.

http://www.lmfdb.org/EllipticCurve/2.0.11.1/%5B11%2C5%2C1%5D/a/1 http://www.lmfdb.org/SatoTateGroup/1.4.USp%284%29 http://www.lmfdb.org/SatoTateGroup/1.4.G_%7B3,3%7D http://www.lmfdb.org/EllipticCurve/3.1.23.1/89.1/A/1

I note that there are already many LMFDB URLs with uper case letters in them (and they are case sensitve, indeed even http://www.lmfdb.org/EllipticCurve/Q/ is case sensitve).

jwj61 commented 8 years ago

Should we start enforcing that newly designed labels only use characters which do not need encoding when in a url?

davidfarmer commented 8 years ago

I'd like to have only completely safe, no-thought-required labels, especially those which are going to be auto-generated. It is asking for trouble if you have to worry about escaping characters, or upper/lower case. If you have to worry about that then you didn't save yourself any work. Lower casing and a few replacements of "unsafe" characters is relatively foolproof, with the result that all labels only involve: a-z0-9._-

haraldschilly commented 8 years ago

@AndrewVSutherland wrote:

http://www.lmfdb.org/EllipticCurve/2.0.11.1/%5B11%2C5%2C1%5D/a/1 http://www.lmfdb.org/SatoTateGroup/1.4.USp%284%29 http://www.lmfdb.org/SatoTateGroup/1.4.G_%7B3,3%7D

In my eyes, those three URLs are already a good example why using more characters is not a good idea. Long time ago, I had the impression that almost all URLs in the LMFDB should be "nice" and possibly be used to be referenced from papers. I am fully aware, that the escaped characters above are just square brackets and commas, but at the same time they demonstrate how easily you are ending up with a more messy representation. Some non teach-savvy user might just copy/paste the above into a paper ...

JohnCremona commented 8 years ago

On Harald's last comment -- when I realised that this was a side-effect of my labelling system fo ideals in an imaginary quadratic field I was very unhappy, made sure that it did actually work by suitable encoding, and decided that I would have to change those labels. I just have not done so.

The other example is the capital letters which are only used for labels for elliptic curves over one cubic field in the db (data from Gunnells and Yasaki) and it would be very easy to change those to lower case. Best if done before next Tuesday of course ;)

AndrewVSutherland commented 8 years ago

OK, I will change the Sato-Tate labels while I still can (it will mean breaking things briefly). But we still need to decide what to do with the other labels this impacts.

AndrewVSutherland commented 8 years ago

@JohnCremona If you do this, it would be preferable not to break the links on page 75 of the paper http://dx.doi.org/10.1017/fms.2015.33 (i.e. have the ecnf search convert to lower case).

JohnCremona commented 8 years ago

@AndrewVSutherland OK I will not forget that. (But I also don't feel I yet have a mandate for changing the database entries for 4416 elliptic curves over 3.1.23.1....)

AndrewVSutherland commented 8 years ago

After thinking about this some more, I've come around to @jwj61's view, imposing more restrictions on labels is a good thing that could prevent a lot of headaches down the road. I hereby retract the suggestion that we expand the universe of knowl ids and propose that we shrink the universe of labels to those that are valid knowl ids. Requiring all new objects added to the LMFDB to have top knowl links on their home pages will help to enforce this (but obviously we should make it explicit).

(and I will change the Sato-Tate labels in a way that will not break anything -- new code will use new data with new labels, once new code is pushed old data gets deleted).

JohnCremona commented 8 years ago

I can rather easily delete the curves over field 3.1.23.1, edit (upper to lower case) the sou8rce data files and reupload them. Then make a code change so that when labels are entered (into a search box or as a URL) they are converted to lower case -- so as not to break citations in published papers.

More work to do similar for the 5 imaginary quadratic fields though.

jwj61 commented 8 years ago

There are two things we are looking at. One is characters such as commas and parentheses. It looks like we are going to do away with them in labels.

The other is the uppercase/lowercase situation. There, I think the right thing to do when it comes to labels is to let labels have capital letters, but they should

(1) not rely on case (2) accept lower case in urls (3) use lower case versions for top/bottom knowls

Galois groups is a legacy example. I had previously done (1) and (2). I am about to make a pull request for (3) as a proof of concept.

AndrewVSutherland commented 8 years ago

So I think the consensus is that labels should (possibly after converting to lower case) always be valid knowl ids. Given that there is already an issue for ecnf labels (see #1147), which as far as I know are the only current labels that can't be used in knowl ids, I'm going to close this (if anyone knows of other labels that are problematic, please raise an issue for them).

LMFDB / lmfdb

Cannot create top/bottom knowls for objects whose labels are not valid knowl ids #1140