Closed AndrewVSutherland closed 8 years ago
I know now that my labels for ideals of imaginary quadratic fields are really bad, and this is another example. (Another is the problem which can arise having those [,] characters in a URL.) The whole issue of labelling ideals is one which we have spent a lot of time on but never actually done anything about. It will not happen this week. The problem raised is only as issue for imaginary quadratics since for all other fields the conductor part of the label has the form N.i where N is the norm and i an index from 0 (and of course the issue is to know what order the ideals of a given norm are in).
@JohnCremona This issue isn't about whether your ideal labels are good or bad, it is simply about the fact that we have imposed restrictions on knowl ids that are incompatible with many of our existing labels (which we are surely not going to change, especially after May 10). We need to remove the restriction on knowl ids that prevents us from creating top/bottom knowls for objects whose labels contain characters not currently permitted in a knowl id. This is not specific to elliptic curves over imaginary quadratic fields, that just happens to be where I first noticed it.
OK. I was looking for the place where the exact restrictions on knowl ids is explained but could not find it. It would actually be quite easy for this purpose to translate a label such as "ec.nf.[11,5,1]-a1.top" into (say) "ec.nf.11.5.1-a1.top" -- though the knwl ID should include the field label too, so an example would be to go from the curve whose full label is 2.0.4.1-[65,18,1]-a1 to the knwol id "ec.nf.2.0.4.1-65.18.1-a1" by just doing id = id.replace(",",".").replace("[","").replace("]","")
All that is required to fix this is to change line 38 of lmfdb/lmfdb/knowledge/knowl.py
Currently we have: "allowed_knowlid = re.compile("^[a-z0-9.-]+$")"
OK -- we need Harald's input before doing that since he may have had a reason.
As another example, I'll note that you also cannot create a top knowl for the object
http://www.lmfdb.org/EllipticCurve/3.1.23.1/89.1/A/1
because the label name 89.1-A1 contains an upper case character. I'm sure there are many similar examples throughout the LMFDB. If top knowls are going to become a standard feature (which I believe they should) we are going to have to either remove the restriction on knowl ids, or come up with some well-defined mapping from labels to allowed knowl ids.
This strikes me as problematic, since such a mapping is not likely to be invertible.
That is true -- Gunnells & Yasaki used upper case for their isogeny class labels, which I had forgotten.
Just so I don't forget, once we get this fixed we should add top/bottom knowls to ecnf pages (I can make this a separate issue if desired).
Yes, please make it a separate Issue. It could even be implemented now, so that if Harald says we can relax the restriction on knowl ids it can go in right away.
I like Cremona's second comment: replace capital letters by lower case, comma by period, other illegal characters by an underscore, and maybe some other transformations.
In many situations knowl IDs become filenames, and there are a lot of characters you don't want in a filename. Even if our system allows more flexible knowl IDs, because the content comes from a database, I would not favor allowing that because it is not portable.
so, besides +1 to @davidfarmer's comment, I have no clear idea. There might be some parsing going on somewhere, but maybe that's for hastags and not for knowls. Besides that, the most important aspect is, that this does end up as an URL path. So, it's usually a good usability touch to treat URLs are case-insensitive. So, I wouldn't recommend that there is an uppercase A in the URL example by @AndrewVSutherland . Maybe, there should be an string.lower()
to do this first step of canonicalization when handing that URL. Although, of course, that's a bad idea if there is also a different lower case a.
There is super non-helpful RFC for looking up what a valid URL is: http://www.ietf.org/rfc/rfc3986.txt
tl;dr: I would suggest to do simple string replacements where necessary, e.g. .replace("[", "").lower()...
and if you really want to relax the knowl IDs, never ever include magic characters like / ? # =
@davidfarmer I'm not aware of any modern file system that will not allow any suitably quoted printable character in the file name, so I don't see this as an issue.
@haraldschilly I agree that URL's are exactly the right test case, since every label must be converted to a URL, and I would suggest that this is exactly what we should do to make them knowl ids. Below are some examples of objects that have web pages but cannot be given knowls.
http://www.lmfdb.org/EllipticCurve/2.0.11.1/%5B11%2C5%2C1%5D/a/1 http://www.lmfdb.org/SatoTateGroup/1.4.USp%284%29 http://www.lmfdb.org/SatoTateGroup/1.4.G_%7B3,3%7D http://www.lmfdb.org/EllipticCurve/3.1.23.1/89.1/A/1
I note that there are already many LMFDB URLs with uper case letters in them (and they are case sensitve, indeed even http://www.lmfdb.org/EllipticCurve/Q/ is case sensitve).
Should we start enforcing that newly designed labels only use characters which do not need encoding when in a url?
I'd like to have only completely safe, no-thought-required labels, especially those which are going to be auto-generated. It is asking for trouble if you have to worry about escaping characters, or upper/lower case. If you have to worry about that then you didn't save yourself any work. Lower casing and a few replacements of "unsafe" characters is relatively foolproof, with the result that all labels only involve: a-z0-9._-
@AndrewVSutherland wrote:
http://www.lmfdb.org/EllipticCurve/2.0.11.1/%5B11%2C5%2C1%5D/a/1 http://www.lmfdb.org/SatoTateGroup/1.4.USp%284%29 http://www.lmfdb.org/SatoTateGroup/1.4.G_%7B3,3%7D
In my eyes, those three URLs are already a good example why using more characters is not a good idea. Long time ago, I had the impression that almost all URLs in the LMFDB should be "nice" and possibly be used to be referenced from papers. I am fully aware, that the escaped characters above are just square brackets and commas, but at the same time they demonstrate how easily you are ending up with a more messy representation. Some non teach-savvy user might just copy/paste the above into a paper ...
On Harald's last comment -- when I realised that this was a side-effect of my labelling system fo ideals in an imaginary quadratic field I was very unhappy, made sure that it did actually work by suitable encoding, and decided that I would have to change those labels. I just have not done so.
The other example is the capital letters which are only used for labels for elliptic curves over one cubic field in the db (data from Gunnells and Yasaki) and it would be very easy to change those to lower case. Best if done before next Tuesday of course ;)
OK, I will change the Sato-Tate labels while I still can (it will mean breaking things briefly). But we still need to decide what to do with the other labels this impacts.
@JohnCremona If you do this, it would be preferable not to break the links on page 75 of the paper http://dx.doi.org/10.1017/fms.2015.33 (i.e. have the ecnf search convert to lower case).
@AndrewVSutherland OK I will not forget that. (But I also don't feel I yet have a mandate for changing the database entries for 4416 elliptic curves over 3.1.23.1....)
After thinking about this some more, I've come around to @jwj61's view, imposing more restrictions on labels is a good thing that could prevent a lot of headaches down the road. I hereby retract the suggestion that we expand the universe of knowl ids and propose that we shrink the universe of labels to those that are valid knowl ids. Requiring all new objects added to the LMFDB to have top knowl links on their home pages will help to enforce this (but obviously we should make it explicit).
(and I will change the Sato-Tate labels in a way that will not break anything -- new code will use new data with new labels, once new code is pushed old data gets deleted).
I can rather easily delete the curves over field 3.1.23.1, edit (upper to lower case) the sou8rce data files and reupload them. Then make a code change so that when labels are entered (into a search box or as a URL) they are converted to lower case -- so as not to break citations in published papers.
More work to do similar for the 5 imaginary quadratic fields though.
There are two things we are looking at. One is characters such as commas and parentheses. It looks like we are going to do away with them in labels.
The other is the uppercase/lowercase situation. There, I think the right thing to do when it comes to labels is to let labels have capital letters, but they should
(1) not rely on case (2) accept lower case in urls (3) use lower case versions for top/bottom knowls
Galois groups is a legacy example. I had previously done (1) and (2). I am about to make a pull request for (3) as a proof of concept.
So I think the consensus is that labels should (possibly after converting to lower case) always be valid knowl ids. Given that there is already an issue for ecnf labels (see #1147), which as far as I know are the only current labels that can't be used in knowl ids, I'm going to close this (if anyone knows of other labels that are problematic, please raise an issue for them).
I was originally going to add an issue for not having top knowls for elliptic curves over number fields, but in fact, even if this functionality was added to ecnf, it wouldn't work, because many elliptic curves have labels that contain characters such as "[" and "]" that are not allowed in a knowl id.
For example, you cannot create the knowl: "ec.nf.[11,5,1]-a1.top"
This is not specific to elliptic curves over number fields, it causes a problem with Sato-Tate groups, and is likely to cause problems elsewhere as well. Anything character that can appear in a label should be able to appear in a knowl (in particular, we definitely need "(", ")", "[", "]", "{", "}" and possibly "^")