LMFDB / lmfdb-inventory

inventory of the lmfdb database

3 stars 14 forks source link

Add db-abvar to the cloud #54

Closed roed314 closed 6 years ago

roed314 commented 7 years ago

In https://github.com/LMFDB/lmfdb/pull/2208 we suggest removing the beta status from isogeny classes of abelian varieties over finite fields. John Jones said that we should create an issue here to ensure that the corresponding data is accessible for the website.

AndrewVSutherland commented 7 years ago

Done -- the abelian varieties pages now appear to work on the cloud (see http://www.lmfdb.org/Variety/Abelian/Fq/).

The performance is noticeably slower in several places, which I'm guessing has to do with not having indexes set up to support some of the queries (e.g. browse by dimension/Fq-card). The cloud server is using the wired tiger storage engine and this makes unindexed queries much more expensive (it basically has to decompress every document in the collection when searching).

I suggest playing around with the pages on the cloud and adding some additional indexes to support queries that are slow (if you add the indexes on the beta mongo db they will automatically get replicated to the cloud whenever I update).

Also, I noticed that the stats collection is names "fq_isog_stats", which doesn't follow our usual convention of using fq_isog.stats (which means, for example, that the collection is listed on http://www.lmfdb.org/api/ whereas it would normally be hidden unless you go to http://www.lmfdb.org/api/all). I suggest copying this collection to fq_isog.stats and changing the code to use this name.

roed314 commented 7 years ago

I've copied the collection to fq_isog.stats and updated the pull request with a commit which uses it. Presumably the old collection should be deleted at some point, but I'm going to hold off on that until the code gets updated to use it.

Is there a way to tell which fields have indexes set up? I'm fairly confident that we have indexes on g and q, but I agree that page loads are quite slow....

JohnCremona commented 7 years ago

.index_information()

AndrewVSutherland commented 7 years ago

You can use the pymongo method index_information() (see https://api.mongodb.com/python/current/api/pymongo/collection.html?highlight=index_information#pymongo.collection.Collection.index_information)

sage: lmfdb.base.getDBConnection().abvar.fq_isog.index_information()

{u'A_counts_1': {u'key': [(u'A_counts', 1)], u'ns': u'abvar.fq_isog', u'v': 1}, u'C_counts_1': {u'key': [(u'C_counts', 1)], u'ns': u'abvar.fq_isog', u'v': 1}, u'id': {u'key': [(u'_id', 1)], u'ns': u'abvar.fq_isog', u'v': 1}, u'decomposition_1': {u'key': [(u'decomposition', 1)], u'ns': u'abvar.fq_isog', u'v': 1}, u'known_jacobian_1': {u'key': [(u'known_jacobian', 1)], u'ns': u'abvar.fq_isog', u'v': 1}, u'label_1': {u'key': [(u'label', 1)], u'ns': u'abvar.fq_isog', u'v': 1}, u'p_rank_1': {u'key': [(u'p_rank', 1)], u'ns': u'abvar.fq_isog', u'v': 1}, u'polynomial_1': {u'key': [(u'polynomial', 1)], u'ns': u'abvar.fq_isog', u'v': 1}, u'principally_polarizable_1': {u'key': [(u'principally_polarizable', 1)], u'ns': u'abvar.fq_isog', u'v': 1}, u'slopes_1': {u'key': [(u'slopes', 1)], u'ns': u'abvar.fq_isog', u'v': 1}}

It looks to me like there are not indices on either g or q.

To create indices on "g" and "q" you can just use

lmfdb.base.getDBConnection().abvar.fq_isog.create_index("g") lmfdb.base.getDBConnection().abvar.fq_isog.create_index("q")

For further details see https://api.mongodb.com/python/current/api/pymongo/collection.html?highlight=create_index#pymongo.collection.Collection.create_index

roed314 commented 7 years ago

Thanks! I'll do that now.

roed314 commented 7 years ago

Thanks; I've created the indices. We'll poke around and see if there are any reasonable queries that still feel sluggish.

AndrewVSutherland commented 7 years ago

I updated the fq_isog collection on the cloud so it picked up the indices on g and q that you added (the corresponding queries are now much snappier,e.g. http://www.lmfdb.org/Variety/Abelian/Fq/1/27/ now loads very quickly).

edgarcosta commented 7 years ago

When running the tests on abvar there are still some queries that run slow and show up in the log (usually queries faster than a certain threshold, which I think is 100ms, do not show up in the log) Here is a sample: https://gist.github.com/anonymous/5cc58a5ef1c2950e6e23c2a75cc60ca9 I'm not sure all of these can be fixed by adding indexes...

roed314 commented 7 years ago

How should I interpret these? I see a bunch of repetitions of the same query, with what looks like increasing durations at the end. Is it the same query repeated multiple times, or is this showing some other behavior? David

On Mon, Sep 4, 2017 at 10:41 PM, Edgar Costa notifications@github.com wrote:

When running the tests on abvar there are still some queries that run slow and show up in the log (usually queries faster than a certain threshold, which I think is 100ms, do not show up in the log) Here is a sample: https://gist.github.com/anonymous/5cc58a5ef1c2950e6e23c2a75cc60ca9 I'm not sure all of these can be fixed by adding indexes...

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/LMFDB/lmfdb-inventory/issues/54#issuecomment-327057031, or mute the thread https://github.com/notifications/unsubscribe-auth/AABZCxVzKfUQl3xQPkRtwv11pB_38F9Uks5sfLTZgaJpZM4Ol5Sb .

edgarcosta commented 7 years ago

Hi David,

This is indeed confusing.

1- Feel free to ignore repeated queries. I basically grepped the MongoDB log for abvar, cut the time stamp away, and sorted it, to group things together. and that is why you see the same bunch of repetitions of the same query, with what looks like increasing durations at the end. Also, some queries repeat multiple times because I run the tests on abvar multiple times (I was trying to track down another bug).

2- Each line is "report" regarding a certain command that took longer than a certain threshold. A good rule of thumb is that on each query there should be at least one key that is indexed. One might need to use compound indexes in the case that one key isn't enough.

3- In some of the queries, you will see "sort: { label: 1 }". That is me sorting your search results by the label. At the moment they aren't sorted. This breaks some of the tests if one doesn't read from the server at Warwick. Most likely tomorrow I will have the pull request ready which adds the sort and fixes the tests.

Cheers, Edgar

On 4 September 2017 at 22:48, roed314 notifications@github.com wrote:

How should I interpret these? I see a bunch of repetitions of the same query, with what looks like increasing durations at the end. Is it the same query repeated multiple times, or is this showing some other behavior? David

On Mon, Sep 4, 2017 at 10:41 PM, Edgar Costa notifications@github.com wrote:

When running the tests on abvar there are still some queries that run slow and show up in the log (usually queries faster than a certain threshold, which I think is 100ms, do not show up in the log) Here is a sample: https://gist.github.com/anonymous/5cc58a5ef1c2950e6e23c2a75cc60ca9 I'm not sure all of these can be fixed by adding indexes...

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/LMFDB/lmfdb-inventory/issues/54# issuecomment-327057031, or mute the thread https://github.com/notifications/unsubscribe-auth/ AABZCxVzKfUQl3xQPkRtwv11pB_38F9Uks5sfLTZgaJpZM4Ol5Sb .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/LMFDB/lmfdb-inventory/issues/54#issuecomment-327057829, or mute the thread https://github.com/notifications/unsubscribe-auth/AATtBgcj5okz4QAkfJiQI7iMDIQIFHbfks5sfLZ1gaJpZM4Ol5Sb .

roed314 commented 7 years ago

Some of these queries might have been from before we added the index on q and g. Also, is there a good way to have indices on things like point counts (where the field is a list and we're constraining specific entries in the list)? David

On Tue, Sep 5, 2017 at 12:02 AM, Edgar Costa notifications@github.com wrote:

Hi David,

This is indeed confusing.

1- Feel free to ignore repeated queries. I basically grepped the MongoDB log for abvar, cut the time stamp away, and sorted it, to group things together. and that is why you see the same bunch of repetitions of the same query, with what looks like increasing durations at the end. Also, some queries repeat multiple times because I run the tests on abvar multiple times (I was trying to track down another bug).

2- Each line is "report" regarding a certain command that took longer than a certain threshold. A good rule of thumb is that on each query there should be at least one key that is indexed. One might need to use compound indexes in the case that one key isn't enough.

3- In some of the queries, you will see "sort: { label: 1 }". That is me sorting your search results by the label. At the moment they aren't sorted. This breaks some of the tests if one doesn't read from the server at Warwick. Most likely tomorrow I will have the pull request ready which adds the sort and fixes the tests.

Cheers, Edgar

On 4 September 2017 at 22:48, roed314 notifications@github.com wrote:

How should I interpret these? I see a bunch of repetitions of the same query, with what looks like increasing durations at the end. Is it the same query repeated multiple times, or is this showing some other behavior? David

On Mon, Sep 4, 2017 at 10:41 PM, Edgar Costa notifications@github.com wrote:

When running the tests on abvar there are still some queries that run slow and show up in the log (usually queries faster than a certain threshold, which I think is 100ms, do not show up in the log) Here is a sample: https://gist.github.com/anonymous/5cc58a5ef1c2950e6e23c2a75cc60ca9 I'm not sure all of these can be fixed by adding indexes...

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/LMFDB/lmfdb-inventory/issues/54# issuecomment-327057031, or mute the thread https://github.com/notifications/unsubscribe-auth/ AABZCxVzKfUQl3xQPkRtwv11pB_38F9Uks5sfLTZgaJpZM4Ol5Sb .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/LMFDB/lmfdb-inventory/issues/54# issuecomment-327057829, or mute the thread https://github.com/notifications/unsubscribe-auth/ AATtBgcj5okz4QAkfJiQI7iMDIQIFHbfks5sfLZ1gaJpZM4Ol5Sb

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/LMFDB/lmfdb-inventory/issues/54#issuecomment-327065731, or mute the thread https://github.com/notifications/unsubscribe-auth/AABZC7Lh1moFkkOv3s1X1yWt6_h15dVVks5sfMfagaJpZM4Ol5Sb .

edgarcosta commented 7 years ago

One more thing. The IXSCAN (and perhaps also the COLL_SCAN) tell you what index the query ended up using...

That is what I got from here: https://docs.mongodb.com/manual/reference/explain-results/

On 5 September 2017 at 00:02, Edgar Costa edgarcosta@math.dartmouth.edu wrote:

Hi David,

This is indeed confusing.

1- Feel free to ignore repeated queries. I basically grepped the MongoDB log for abvar, cut the time stamp away, and sorted it, to group things together. and that is why you see the same bunch of repetitions of the same query, with what looks like increasing durations at the end. Also, some queries repeat multiple times because I run the tests on abvar multiple times (I was trying to track down another bug).

2- Each line is "report" regarding a certain command that took longer than a certain threshold. A good rule of thumb is that on each query there should be at least one key that is indexed. One might need to use compound indexes in the case that one key isn't enough.

3- In some of the queries, you will see "sort: { label: 1 }". That is me sorting your search results by the label. At the moment they aren't sorted. This breaks some of the tests if one doesn't read from the server at Warwick. Most likely tomorrow I will have the pull request ready which adds the sort and fixes the tests.

Cheers, Edgar

On 4 September 2017 at 22:48, roed314 notifications@github.com wrote:

How should I interpret these? I see a bunch of repetitions of the same query, with what looks like increasing durations at the end. Is it the same query repeated multiple times, or is this showing some other behavior? David

On Mon, Sep 4, 2017 at 10:41 PM, Edgar Costa notifications@github.com wrote:

When running the tests on abvar there are still some queries that run slow and show up in the log (usually queries faster than a certain threshold, which I think is 100ms, do not show up in the log) Here is a sample: https://gist.github.com/anonymous/5cc58a5ef1c2950e6e23c2a75cc60ca9 I'm not sure all of these can be fixed by adding indexes...

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/LMFDB/lmfdb-inventory/issues/54#issuecom ment-327057031, or mute the thread https://github.com/notifications/unsubscribe-auth/AABZCxVzK fUQl3xQPkRtwv11pB_38F9Uks5sfLTZgaJpZM4Ol5Sb .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/LMFDB/lmfdb-inventory/issues/54#issuecomment-327057829, or mute the thread https://github.com/notifications/unsubscribe-auth/AATtBgcj5okz4QAkfJiQI7iMDIQIFHbfks5sfLZ1gaJpZM4Ol5Sb .

roed314 commented 7 years ago

On Tue, Sep 5, 2017 at 12:14 AM, Edgar Costa notifications@github.com wrote:

One more thing. The IXSCAN (and perhaps also the COLL_SCAN) tell you what index the query ended up using...

I see. So it looks like some of the slow queries were in fact using indices. I'm happy to update things if anyone has suggestions. In particular, I'm about to make data changes to this collection, so if there are suggestions on how to restructure the layout I'm open to it.

The current layout is

0 label string

1 g int

2 q int

3 polynomial list of ints

4 angle_numbers list of floats

5 angle_ranks int ("" if unfilled)

6 p_rank int

7 slopes list of strings

8 A_counts list of strings

9 C_counts list of strings

10 known_jacobian int

11 principally_polarizable int

12 decomposition list of pairs [string, int]

13 brauer_invariants list of strings

14 places list of lists of lists of strings

15 primitive_models list of strings ("" if unfilled)

16 number_field string ("" if non-simple)

17 galois_n int ("" if non-simple)

18 galois_t int ("" if unknown/non-simple)

The indices are id, label, g, q, polynomial, p_rank, slopes, A_counts, C_counts, known_jacobian, principally_polarizable, decomposition.

I'm planning on redoing the galois groups as BSON pairs and adding them as a searchable field. David

That is what I got from here: https://docs.mongodb.com/manual/reference/explain-results/

On 5 September 2017 at 00:02, Edgar Costa edgarcosta@math.dartmouth.edu

wrote:

Hi David,

This is indeed confusing.

1- Feel free to ignore repeated queries. I basically grepped the MongoDB log for abvar, cut the time stamp away, and sorted it, to group things together. and that is why you see the same bunch of repetitions of the same query, with what looks like increasing durations at the end. Also, some queries repeat multiple times because I run the tests on abvar multiple times (I was trying to track down another bug).

2- Each line is "report" regarding a certain command that took longer than a certain threshold. A good rule of thumb is that on each query there should be at least one key that is indexed. One might need to use compound indexes in the case that one key isn't enough.

3- In some of the queries, you will see "sort: { label: 1 }". That is me sorting your search results by the label. At the moment they aren't sorted. This breaks some of the tests if one doesn't read from the server at Warwick. Most likely tomorrow I will have the pull request ready which adds the sort and fixes the tests.

Cheers, Edgar

On 4 September 2017 at 22:48, roed314 notifications@github.com wrote:

How should I interpret these? I see a bunch of repetitions of the same query, with what looks like increasing durations at the end. Is it the same query repeated multiple times, or is this showing some other behavior? David

On Mon, Sep 4, 2017 at 10:41 PM, Edgar Costa notifications@github.com wrote:

When running the tests on abvar there are still some queries that run slow and show up in the log (usually queries faster than a certain threshold, which I think is 100ms, do not show up in the log) Here is a sample: https://gist.github.com/anonymous/5cc58a5ef1c2950e6e23c2a75cc60ca9 I'm not sure all of these can be fixed by adding indexes...

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/LMFDB/lmfdb-inventory/issues/54#issuecom ment-327057031, or mute the thread https://github.com/notifications/unsubscribe-auth/AABZCxVzK fUQl3xQPkRtwv11pB_38F9Uks5sfLTZgaJpZM4Ol5Sb .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/LMFDB/lmfdb-inventory/issues/54# issuecomment-327057829, or mute the thread https://github.com/notifications/unsubscribe-auth/ AATtBgcj5okz4QAkfJiQI7iMDIQIFHbfks5sfLZ1gaJpZM4Ol5Sb .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/LMFDB/lmfdb-inventory/issues/54#issuecomment-327066879, or mute the thread https://github.com/notifications/unsubscribe-auth/AABZC0_j-DSzva90vK-USQlXigg3daMqks5sfMqPgaJpZM4Ol5Sb .

edgarcosta commented 7 years ago

@AndrewVSutherland and @JohnCremona are the masters of designing indices and picking the right data formats.

I think the key question is what fields would you like to be able to search, and for what kind of searches. That will most likely mandate or data type. For example, I believe that lists are not ideal to be searched on... I think we learned that with number fields, but I'm not completely sure.

Regarding the indices, at the moment you don't have any compound indices. You may want to consider some of those, as you really want your indices to ensure some selectivity. Looking at the last queries I would suggest [q, g, p_rank] [q, g, principally_polarizable] [q, g, primitive_models] [q, g, known_jacobian] [q, g, decomposition] [q, g, label] # for sorting and if some of these, are not really selective, I would go for some four deep.

Usually, you don't have anything to loose, to make all the above four deep, as all the prefixes will be available. See: https://docs.mongodb.com/manual/core/index-compound/#prefixes

I think I found an example of why lists are bad. Even though you indexed polynomial, when you queried polynomial entries and p_rank, MongoDB only used the p_rank to reduce the sample size. After that, most likely looped over all the remaining objects...

abvar.fq_isog command: find { find: "fq_isog", filter: { polynomial.4: 9, p_rank: 2, polynomial.1: 1, polynomial.2: -1, polynomial.3: 3 }, limit: 50 } planSummary: IXSCAN { p_rank: 1 } keysExamined:644562 docsExamined:644562 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:5057 nreturned:7 reslen:9050 locks:{ Global: { acquireCount: { r: 10116 } }, Database: { acquireCount: { r: 5058 } }, Collection: { acquireCount: { r: 5058 } } } protocol:op_query 6291ms

edgarcosta commented 7 years ago

I looked a bit into your search fields, and I would consider converting almost every list into a string, that represents a list. In most cases, you are trying to match the first elements of an entry, and there is no hope to try to use multi key indexes for that (see https://docs.mongodb.com/manual/core/index-multikey/#examples) However, matching the beginning of a string should be much faster, I hope...

Alternatively, you could store each array element by itself...and then you could use a compound index: [polynomial0, polynomial1, polynomial2, ..., polynomial12] However, one should use smaller keys, otherwise, most of our space is dedicated to keys instead of values...

roed314 commented 7 years ago

On Tue, Sep 5, 2017 at 1:50 AM, Edgar Costa notifications@github.com wrote:

I looked a bit into your search fields, and I would consider converting almost every list into a string, that represents a list. In most cases, you are trying to match the first elements of an entry, and there is no hope to try to use multi key indexes for that (see https://docs.mongodb.com/manual/core/index-multikey/#examples) However, matching the beginning of a string should be much faster, I hope...

Alternatively, you could store each array element by itself...and then you could use a compound index: [polynomial0, polynomial1, polynomial2, ..., polynomial12]

I don't think the string approach will have the same functionality, since you can't search by ranges. Storing each array element by itself could work, though it's a bit messy. I'd be fine with having polynomial, A_counts and C_counts split out into separate fields, with the compound incides set up for searching based on initial segments.

The search based on decomposition type is more complicated. I guess we could split it out in a similar way, though writing the parsing code that's currently implemented took me a couple days....

It may also be worth adding simple fields for is_simple and is_primitive, since those are currently read off of the decomposition and primitive_model fields respectively, which aren't indexed. David

However, one should use smaller keys, otherwise, most of our space is dedicated to keys instead of values...

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/LMFDB/lmfdb-inventory/issues/54#issuecomment-327077736, or mute the thread https://github.com/notifications/unsubscribe-auth/AABZC0uGU15FBxmxzVgPZvp217_VItZxks5sfOFDgaJpZM4Ol5Sb .

JohnCremona commented 7 years ago

A lot of the issues arising here are also arising all over the place. A lot of the collections were set up when we (collectively) had little idea about what data fields were efficient. @roed314 this is not just about "your" collection! Lists are particularly problematic.

Here's just one example: in elliptic_curves.curves (for elliptic curves over Q) there's a field 'ainvs' which is a list of 5 strings for the 5 a-invariants -- with strings not ints since a4 and a6 can be very large. But we replaced that with 'xainvs' which is a single string, e.g. ainvs : [u'0', u'-1', u'1', u'-10', u'-20'] becomes xainvs: u'[0,-1,1,-10,-20]' . We no longer use the original ainvs (but I have never got around to deleting it). The xainvs field is good for searching for a specific curve (which we do do sometimes), and we never want to search for curves with a_1=1, say.

roed314 commented 7 years ago

I don't think it would be too bad to switch to a string for the polynomial coefficients, though note that it does make some sense to do a search with a range on one (or more) of the coefficients since that's how the algorithm for searching for these polynomials works.

For the point count lists though, Drew specifically requested the ability to search by range, so I don't think a string is the way to go. David

On Tue, Sep 5, 2017 at 5:12 AM, John Cremona notifications@github.com wrote:

A lot of the issues arising here are also arising all over the place. A lot of the collections were set up when we (collectively) had little idea about what data fields were efficient. @roed314 https://github.com/roed314 this is not just about "your" collection! Lists are particularly problematic.

Here's just one example: in elliptic_curves.curves (for elliptic curves over Q) there's a field 'ainvs' which is a list of 5 strings for the 5 a-invariants -- with strings not ints since a4 and a6 can be very large. But we replaced that with 'xainvs' which is a single string, e.g. ainvs : [u'0', u'-1', u'1', u'-10', u'-20'] becomes xainvs: u'[0,-1,1,-10,-20]' . We no longer use the original ainvs (but I have never got around to deleting it). The xainvs field is good for searching for a specific curve (which we do do sometimes), and we never want to search for curves with a_1=1, say.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/LMFDB/lmfdb-inventory/issues/54#issuecomment-327118475, or mute the thread https://github.com/notifications/unsubscribe-auth/AABZC7tet0nMzw6xuagMikVcBRjNFpCNks5sfRB5gaJpZM4Ol5Sb .

edgarcosta commented 7 years ago

Then I would really consider storing the first elements of each list individually, so I can use those to reduce your haystack, on some more fancy searches. In the end of the day, I don't think we can have a pretty and robust solution at the same time :(

On 5 September 2017 at 13:00, roed314 notifications@github.com wrote:

I don't think it would be too bad to switch to a string for the polynomial coefficients, though note that it does make some sense to do a search with a range on one (or more) of the coefficients since that's how the algorithm for searching for these polynomials works.

For the point count lists though, Drew specifically requested the ability to search by range, so I don't think a string is the way to go. David

On Tue, Sep 5, 2017 at 5:12 AM, John Cremona notifications@github.com wrote:

A lot of the issues arising here are also arising all over the place. A lot of the collections were set up when we (collectively) had little idea about what data fields were efficient. @roed314 https://github.com/roed314 this is not just about "your" collection! Lists are particularly problematic.

Here's just one example: in elliptic_curves.curves (for elliptic curves over Q) there's a field 'ainvs' which is a list of 5 strings for the 5 a-invariants -- with strings not ints since a4 and a6 can be very large. But we replaced that with 'xainvs' which is a single string, e.g. ainvs : [u'0', u'-1', u'1', u'-10', u'-20'] becomes xainvs: u'[0,-1,1,-10,-20]' . We no longer use the original ainvs (but I have never got around to deleting it). The xainvs field is good for searching for a specific curve (which we do do sometimes), and we never want to search for curves with a_1=1, say.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/LMFDB/lmfdb-inventory/issues/54#issuecom ment-327118475, or mute the thread https://github.com/notifications/unsubscribe-auth/AABZC7tet 0nMzw6xuagMikVcBRjNFpCNks5sfRB5gaJpZM4Ol5Sb

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/LMFDB/lmfdb-inventory/issues/54#issuecomment-327239660, or mute the thread https://github.com/notifications/unsubscribe-auth/AATtBozzIQ8-JiJPMJleb4K9jCI0tN75ks5sfX42gaJpZM4Ol5Sb .

AndrewVSutherland commented 7 years ago

Please correct me if I am wrong @roed314, but at the moment the only searches we support on counts are searching for a prefix (e.g. typing [10,20] will match [10,20], [10,20,40],...). This would still work (and work much more quickly) using a string. I could also see an argument for adding a separate field with just the point count over the base field (possibly stored as an integer rather than a string) that we could do ranged queries on (but see below for a more elegant long term solution). But supporting range queries on point counts will require modifying the browse page and is not something I think needs to be done before we put this into production, which I would like to do soon.

Speeding up the searches is more urgent and I think this really does need to be done before we go into production (inefficient queries are likely to become even more so on the cloud system because of differences between mmap and wired tiger). For example, the query:

http://www.lmfdb.org/Variety/Abelian/Fq/?abvar_point_count=1

Currently takes about 7 seconds on the production site, versus about 3 seconds on beta.

From a database design perspective, the "right" solution to this issue is to normalize the data (i.e. put it in relational form using multiple tables each of which has "flat" rows). This would mean. creating a separate table with rows (abvar-label,ext-degree,A_count,C_count) that has g rows for each abelian variety of dimension g, corresponding to extensions of degree 1,2,...,g. One can then index both the degree and count fields and efficiently support all sorts of complicated queries.

This solution is both "pretty" and robust, but it would require non-trivial changes to the way the existing query works, whereas changing the current list to a string would be straightforward and would solve the immediate performance issue (my guess is 7 seconds would drop to under half a second).

edgarcosta commented 7 years ago

A different pretty of what I had in mind, but indeed pretty and robust.

Today while trying to add indexes to this collection and fixing other things, I realized that the queries were never using the multikey index, and then I had an epiphany: "Multikey indexes are reverse lookups for its elements." Even though, MongoDB is aware of this, they don't use the reverse lookups unless we explicitly ask them.

Here is an example: if we do

find({ C_counts.0: "9", C_counts.1: "87" }).sort({ g: 1, q: 1, label: 1, p_rank: 1 }).limit(50)

it is not going to use any decent index, and it's going to default to the one we use to sort the results, and it takes 4.1s.

2017-09-06T16:10:21.193+0000 I COMMAND  [conn697362] command abvar.fq_isog command: find { find: "fq_isog", filter: { C_counts.0: "9", C_counts.1: "87" }, sort: { g: 1, q: 1, label: 1, p_rank: 1 }, limit: 50 } planSummary: IXSCAN { g: 1, q: 1, label: 1, p_rank: 1 } keysExamined:645751 docsExamined:645751 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:5045 nreturned:50 reslen:55770 locks:{ Global: { acquireCount: { r: 10092 } }, Database: { acquireCount: { r: 5046 } }, Collection: { acquireCount: { r: 5046 } } } protocol:op_query 4170ms

However, if we do

find({ C_counts: { $all: [ "9", "87" ] }, C_counts.1: "87", C_counts.0: "9" }).sort({ g: 1, q: 1, label: 1, p_rank: 1 }).limit(50)

now it finally uses its dedicated multikey index, as it basically searches for all the objects where 9 and 87 are in C_counts, and then filters for documents whose the first entry is indeed 9 and the second 87. This provides us with a huge speed up, and now it only takes 167ms.

2017-09-06T16:09:34.575+0000 I COMMAND  [conn697362] command abvar.fq_isog command: find { find: "fq_isog", filter: { C_counts: { $all: [ "9", "87" ] }, C_counts.1: "87", C_counts.0: "9" }, sort: { g: 1, q: 1, label: 1, p_rank: 1 }, limit: 50 } planSummary: IXSCAN { C_counts: 1, g: 1, q: 1, label: 1, p_rank: 1 } keysExamined:2274 docsExamined:2274 fromMultiPlanner:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:90 nreturned:50 reslen:55770 locks:{ Global: { acquireCount: { r: 182 } }, Database: { acquireCount: { r: 91 } }, Collection: { acquireCount: { r: 91 } } } protocol:op_query 167ms

At the moment, I added too many indexes to the database, basically every compound index. I will drop some soon.

On a side note, I assume we will be sorting our results, [g, q, label, p_rank]. I will make a pull request soon enough to address this (I have already modified the tests to reflect this).

AndrewVSutherland commented 7 years ago

@edgarcosta Nice! This is a quick and easy fix.

edgarcosta commented 7 years ago

I did some tests and the operand $all is great, the others not so much...

searching for initial coefficients = [9, 87-100]

The Naive method 6s (no index used):

017-09-06T18:27:03.639+0000 I COMMAND  [conn698695] command abvar.fq_isog command: count { count: "fq_isog", query: { polynomial.1: 9, polynomial.2: { $lte: 100, $gte: 87 } } } planSummary: COLLSCAN keyUpdates:0 writeConflicts:0 numYields:10683 reslen:62 locks:{ Global: { acquireCount: { r: 21368 } }, Database: { acquireCount: { r: 10684 } }, Collection: { acquireCount: { r: 10684 } } } protocol:op_query 5778ms
2017-09-06T18:27:03.904+0000 I COMMAND  [conn698695] command abvar.fq_isog command: find { find: "fq_isog", filter: { polynomial.1: 9, polynomial.2: { $lte: 100, $gte: 87 } }, sort: { g: 1, q: 1, label: 1, p_rank: 1 }, limit: 50 } planSummary: IXSCAN { g: 1, q: 1, label: 1, p_rank: 1 } keysExamined:29370 docsExamined:29370 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:229 nreturned:50 reslen:54141 locks:{ Global: { acquireCount: { r: 460 } }, Database: { acquireCount: { r: 230 } }, Collection: { acquireCount: { r: 230 } } } protocol:op_query 228ms

Using $all on the 9 3s (polynomial index used):

2017-09-06T18:28:15.470+0000 I COMMAND  [conn698717] command abvar.fq_isog command: count { count: "fq_isog", query: { polynomial: { $all: [ 9 ] }, polynomial.1: 9, polynomial.2: { $lte: 100, $gte: 87 } } } planSummary: IXSCAN { polynomial: 1, g: 1, q: 1, label: 1, p_rank: 1 } keyUpdates:0 writeConflicts:0 numYields:796 reslen:62 locks:{ Global: { acquireCount: { r: 1594 } }, Database: { acquireCount: { r: 797 } }, Collection: { acquireCount: { r: 797 } } } protocol:op_query 2896ms

Adding $lte and $gte to the polynomial 17s (the index on polynomial is used, but it doesn't reduce the search space)

2017-09-06T18:28:15.470+0000 I COMMAND  [conn698717] command abvar.fq_isog command: count { count: "fq_isog", query: { polynomial: { $all: [ 9 ] }, polynomial.1: 9, polynomial.2: { $lte: 100, $gte: 87 } } } planSummary: IXSCAN { polynomial: 1, g: 1, q: 1, label: 1, p_rank: 1 } keyUpdates:0 writeConflicts:0 numYields:796 reslen:62 locks:{ Global: { acquireCount: { r: 1594 } }, Database: { acquireCount: { r: 797 } }, Collection: { acquireCount: { r: 797 } } } protocol:op_query 2896ms

edgarcosta commented 7 years ago

I wasn't adding the $lte and $gte in a correct manner, I need to use $elemMatch But we can only use that operator once inside of $all. But we could add more by wrapping everything with an $and but that doesn't help us at all.

Now the same query is down to 1.7s.

I push a pull request tomorrow once I clean up my code.

roed314 commented 7 years ago

Great. I was working on some Sage stuff today and have a bunch of teaching tomorrow, but I'm planning on working on updating the abvar database on Friday: adding the number field data from John, marking some curves as known non-Jacobians due to some observations of Noam, and fixing the non-indexed lists as pointed out in this thread. Edgar, if you do make some more changes tomorrow, please let me know so that we can avoid merge conflicts. David

On Wed, Sep 6, 2017 at 4:32 PM, Edgar Costa notifications@github.com wrote:

I wasn't adding the $lte and $gte in a correct manner, I need to use $elemMatch But we can only use that operator once inside of $all. But we could add more by wrapping everything with an $and but that doesn't help us at all.

Now the same query is down to 1.7s.

I push a pull request tomorrow once I clean up my code.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/LMFDB/lmfdb-inventory/issues/54#issuecomment-327604469, or mute the thread https://github.com/notifications/unsubscribe-auth/AABZCy5GCz3-0r0xjRZEQpRk_Mdg_-K_ks5sfwFMgaJpZM4Ol5Sb .

edgarcosta commented 7 years ago

@roed314 I'm done touching abvar code.

roed314 commented 7 years ago

I've written the script for updating the collection, but I had some questions before running it.

My current plan is to add new fields but not yet remove anything (and I'll keep all the old indexes, as well as add new ones for the new fields). I'll then submit a pull request to change the website code to use the new format. Once this request is merged, then I'll run another rewrite command to remove the old fields and old indexes. Does that sound reasonable?
When one runs rewrite_collection, you provide a new collection name. Is the standard procedure to check that the result is sensible, then delete the old collection and rename the new one so that it has the original name? Or is there something else I should be doing instead?
Currently, abvar.fq_isog has some indexes that seem redundant. Namely, it has a bunch of indexes on single fields, but I thought that the first field in a multi-index would provide the same functionality (for example, it has both [(u'p_rank', 1)] and [(u'p_rank', 1), (u'polynomial', 1)]). Also, it seems redundant to include 'p_rank' after 'label' since there is only one row with each label. Or am I missing something?
If I want to sort by the coefficients of the Weil polynomial, lexicographically and numerically (ie, not as strings), should I create a new string field which has the same sort order? Or is it best to keep the field that has the coefficients as ints around, index on it, and use that for sorting?

Here are my planned changes to the data:

change the name angle_ranks -> angle_rank (it was a typo)
add a bunch of number fields and Galois group data that John Jones computed
mark some classes as not-Jacobians due to some observations of Noam
change 'slopes' (list of strings) to 'np_slopes' (space separated string)
change 'C_counts' (list of strings) to 'curve_counts' (space separated string)
add field 'point_count' as an int, giving the point count of the virtual curve
change 'A_counts' (list of strings) to 'abvar_counts' (space separated string)
change 'polynomial' (list of ints) to 'weil_poly' (space separated string)
change the two fields 'galois_n' and 'galois_t' to a BSON pair 'galois'
add boolean fields 'is_simple' and 'is_primitive' for better searching

Thanks! David

On Thu, Sep 7, 2017 at 10:02 AM, Edgar Costa notifications@github.com wrote:

@roed314 https://github.com/roed314 I'm done touching abvar code.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/LMFDB/lmfdb-inventory/issues/54#issuecomment-327809136, or mute the thread https://github.com/notifications/unsubscribe-auth/AABZCy_n0jLevGgWgSwpHVF7Q1EpLnHCks5sf_eSgaJpZM4Ol5Sb .

AndrewVSutherland commented 7 years ago

On 2017-09-09 11:22, roed314 wrote:

I've written the script for updating the collection, but I had some questions before running it.

My current plan is to add new fields but not yet remove anything (and I'll keep all the old indexes, as well as add new ones for the new fields). I'll then submit a pull request to change the website code to use the new format. Once this request is merged, then I'll run another rewrite command to remove the old fields and old indexes. Does that sound reasonable?

Yes, this is the right approach.

When one runs rewrite_collection, you provide a new collection name.
Is the standard procedure to check that the result is sensible, then delete the old collection and rename the new one so that it has the original name? Or is there something else I should be doing instead?

Exactly.

Currently, abvar.fq_isog has some indexes that seem redundant.
Namely, it has a bunch of indexes on single fields, but I thought that the first field in a multi-index would provide the same functionality (for example, it has both [(u'p_rank', 1)] and [(u'p_rank', 1), (u'polynomial', 1)]). Also, it seems redundant to include 'p_rank' after 'label' since there is only one row with each label. Or am I missing something?

You are correct on both counts, although having a single-key index that is a prefix of a multi-key index can make sense, e.g. ranged queries on the single key will likely be faster (it mostly depends on how big the remaining keys are, but if you have a very short first key followed by a much longer second key and actually expect to do queries that only specify the first key, it's worth having an index on the first key alone, which may apply in this case).

If I want to sort by the coefficients of the Weil polynomial, lexicographically and numerically (ie, not as strings), should I create a new string field which has the same sort order? Or is it best to keep the field that has the coefficients as ints around, index on it, and use that for sorting?

I expect the former will be significantly faster.

Here are my planned changes to the data:

change the name angle_ranks -> angle_rank (it was a typo)

add a bunch of number fields and Galois group data that John Jones computed

mark some classes as not-Jacobians due to some observations of Noam

change 'slopes' (list of strings) to 'np_slopes' (space separated string)

change 'C_counts' (list of strings) to 'curve_counts' (space separated string)

add field 'point_count' as an int, giving the point count of the virtual curve

change 'A_counts' (list of strings) to 'abvar_counts' (space separated string)

change 'polynomial' (list of ints) to 'weil_poly' (space separated string)

change the two fields 'galois_n' and 'galois_t' to a BSON pair 'galois'

add boolean fields 'is_simple' and 'is_primitive' for better searching

Thanks! David

On Thu, Sep 7, 2017 at 10:02 AM, Edgar Costa notifications@github.com wrote:

@roed314 https://github.com/roed314 I'm done touching abvar code.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/LMFDB/lmfdb-inventory/issues/54#issuecomment-327809136, or mute the thread https://github.com/notifications/unsubscribe-auth/AABZCy_n0jLevGgWgSwpHVF7Q1EpLnHCks5sf_eSgaJpZM4Ol5Sb .

JohnCremona commented 7 years ago

On 9 September 2017 at 10:22, roed314 notifications@github.com wrote:

I've written the script for updating the collection, but I had some questions before running it.

My current plan is to add new fields but not yet remove anything (and I'll keep all the old indexes, as well as add new ones for the new fields). I'll then submit a pull request to change the website code to use the new format. Once this request is merged, then I'll run another rewrite command to remove the old fields and old indexes. Does that sound reasonable?

That would work. One always has to be careful when both code and data change. The second step is not urgent.

When one runs rewrite_collection, you provide a new collection name. Is the standard procedure to check that the result is sensible, then delete the old collection and rename the new one so that it has the original name? Or is there something else I should be doing instead?

No. It never deletes or renames anything itself. The standard is to take collection 'coll' and rewrite to 'coll.new' which must not exist. You can test your code against coll.new by changing the collection name in your code (which should be written so that this is one place). When ready, rename coll --> coll.old and coll.new --> coll (which is instantaneous). This is all on beta so that renameing should be done right when the new code is merged to beta.

Currently, abvar.fq_isog has some indexes that seem redundant. Namely, it has a bunch of indexes on single fields, but I thought that the first field in a multi-index would provide the same functionality (for example, it has both [(u'p_rank', 1)] and [(u'p_rank', 1), (u'polynomial', 1)]). Also, it seems redundant to include 'p_rank' after 'label' since there is only one row with each label. Or am I missing something?

Yes you are but it is very hard to know how mongo manages these. If you delete your 'rank' index then searching by rank will be slow even though you still keep the rank+polynomial index. I think your last point is correct though, since the label uniquely identifies an object, I cannot see why one would want any index inclusing label except for the label (only) index itself. Others may correct me!

If I want to sort by the coefficients of the Weil polynomial, lexicographically and numerically (ie, not as strings), should I create a new string field which has the same sort order? Or is it best to keep the field that has the coefficients as ints around, index on it, and use that for sorting?

Well mongo does not all customising the sort so either do what you suggest or do the sorting after getting back the search results. I do the former in a few places since elliptic curve isogeny class labels a,b,...,z,ba,bb,...bz,ca,... cannot be sorted as such, so I have an additional field which is an int with the same values as the impled base-26 number of the label.

Here are my planned changes to the data:

change the name angle_ranks -> angle_rank (it was a typo)

Fine. You might consider making that string even shorter if you are changing it anyway. Remember that every entry in the db contains all of these strings.

add a bunch of number fields and Galois group data that John Jones computed

Fine

mark some classes as not-Jacobians due to some observations of Noam

Fine

change 'slopes' (list of strings) to 'np_slopes' (space separated string)

Why change the name?

change 'C_counts' (list of strings) to 'curve_counts' (space separated string)

Ditto? Keep the keys short.

add field 'point_count' as an int, giving the point count of the virtual curve

Fine

change 'A_counts' (list of strings) to 'abvar_counts' (space separated string)

Ditto i.e. why?

change 'polynomial' (list of ints) to 'weil_poly' (space separated string)

or just 'poly'

change the two fields 'galois_n' and 'galois_t' to a BSON pair 'galois'

add boolean fields 'is_simple' and 'is_primitive' for better searching

Thanks! David

Others might have better / other views on any of the above.

On Thu, Sep 7, 2017 at 10:02 AM, Edgar Costa notifications@github.com wrote:

@roed314 https://github.com/roed314 I'm done touching abvar code.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/LMFDB/lmfdb-inventory/issues/54# issuecomment-327809136, or mute the thread https://github.com/notifications/unsubscribe-auth/AABZCy_ n0jLevGgWgSwpHVF7Q1EpLnHCks5sf_eSgaJpZM4Ol5Sb .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/LMFDB/lmfdb-inventory/issues/54#issuecomment-328265583, or mute the thread https://github.com/notifications/unsubscribe-auth/AC9N6XPNnFc07eO0ssHeetuU3PIe0E2wks5sgljbgaJpZM4Ol5Sb .

JohnCremona commented 7 years ago

(I only saw @AndrewVSutherland 's replies after writing my own. Sorry.)

roed314 commented 7 years ago

No problem about the double reply! You both had useful things to say.

I will go ahead and add a sorting key, remove the p_rank from the end of the indexes.

It's frustrating that the key is stored with every record, but I understand that's just a constraint we have to live with. I was changing the names because, in the strategy I described to Drew, I need to keep the old name around. How about the following name changes?

polynomial -> poly angel_numbers -> angles angle_ranks -> ang_rank slopes -> slps A_counts -> A_cnts C_counts -> C_cnts known_jacobian -> is_jac principally_polarizable -> is_pp decomposition -> decomp brauer_invariants -> brauer_invs primitive_models -> prim_models number_field -> nf galois_n & galois_t -> gal

David

On Sat, Sep 9, 2017 at 6:22 AM, John Cremona notifications@github.com wrote:

(I only saw @AndrewVSutherland https://github.com/andrewvsutherland 's replies after writing my own. Sorry.)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/LMFDB/lmfdb-inventory/issues/54#issuecomment-328268408, or mute the thread https://github.com/notifications/unsubscribe-auth/AABZC0SLpiFUFc2ZzC9IybYAoR5-8l1eks5sgmbUgaJpZM4Ol5Sb .

roed314 commented 6 years ago

Sorry for the delay: I got a bit sick over the weekend, and figuring out the sort key took a bit of time.

I've made the changes to the collection (the old collection is still around at abvar.old_fq_isog until we decide there are no problems), and there's a pull request to take advantage of them: https://github.com/LMFDB/lmfdb/pull/2270.

I tried to add a bunch of indices and discovered the limit of 64 indices per collection. I could use some advice on which multi-indices to add, since there seem to be so many possible searches.

I also need to update inventory repo, but that's not going to happen for a bit since I'm leaving on a trip tomorrow. David

On Sat, Sep 9, 2017 at 2:33 PM, David Roe roed.math@gmail.com wrote:

No problem about the double reply! You both had useful things to say.

I will go ahead and add a sorting key, remove the p_rank from the end of the indexes.

It's frustrating that the key is stored with every record, but I understand that's just a constraint we have to live with. I was changing the names because, in the strategy I described to Drew, I need to keep the old name around. How about the following name changes?

polynomial -> poly angel_numbers -> angles angle_ranks -> ang_rank slopes -> slps A_counts -> A_cnts C_counts -> C_cnts known_jacobian -> is_jac principally_polarizable -> is_pp decomposition -> decomp brauer_invariants -> brauer_invs primitive_models -> prim_models number_field -> nf galois_n & galois_t -> gal

David

On Sat, Sep 9, 2017 at 6:22 AM, John Cremona notifications@github.com wrote:

(I only saw @AndrewVSutherland https://github.com/andrewvsutherland 's replies after writing my own. Sorry.)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/LMFDB/lmfdb-inventory/issues/54#issuecomment-328268408, or mute the thread https://github.com/notifications/unsubscribe-auth/AABZC0SLpiFUFc2ZzC9IybYAoR5-8l1eks5sgmbUgaJpZM4Ol5Sb .

edgarcosta commented 6 years ago

Currently, abvar.fq_isog has some indexes that seem redundant. Namely, it has a bunch of indexes on single fields, but I thought that the first field in a multi-index would provide the same functionality (for example, it has both [(u'p_rank', 1)] and [(u'p_rank', 1), (u'polynomial', 1)]). Also, it seems redundant to include 'p_rank' after 'label' since there is only one row with each label. Or am I missing something?

You are correct as they have pointed out. I was the done that decided to sort on [('g', ASCENDING), ('q', ASCENDING), ('label', ASCENDING), ('p_rank', ASCENDING)]) without realizing that p_rank was pointless. I'm not so sure about the claim that a single index claim vs compound index. I read somewhere in the past that compound indexes are usually faster. I cannot find that exact quote anymore, but the best I could is:

If you have a collection that has both a compound index and an index on its prefix (e.g. { a: 1, b: 1 } and { a: 1 }), if neither index has a sparse or unique constraint, then you can remove the index on the prefix (e.g. { a: 1 }). MongoDB will use the compound index in all of the situations that it would have used the prefix index.

from https://docs.mongodb.com/manual/core/index-compound/

It also might be worth it to try unique indexes (https://docs.mongodb.com/manual/core/index-unique/) for indexes involving label or other unique combinations.

I'm sorry for not giving any straightforward solutions, but this index thing is really tricky.

roed314 commented 6 years ago

I've switched over completely to the new data format and run tests; it seems to be working fine.

I tried to create a reasonable set of indexes, though I don't know if I succeeded. You can see them at https://github.com/LMFDB/lmfdb-inventory/pull/72.

I realized that I missed some number fields that can be added, but I don't think it should hold up this coming out of beta. David

On Wed, Sep 13, 2017 at 9:21 AM, Edgar Costa notifications@github.com wrote:

Currently, abvar.fq_isog has some indexes that seem redundant. Namely, it has a bunch of indexes on single fields, but I thought that the first field in a multi-index would provide the same functionality (for example, it has both [(u'p_rank', 1)] and [(u'p_rank', 1), (u'polynomial', 1)]). Also, it seems redundant to include 'p_rank' after 'label' since there is only one row with each label. Or am I missing something?

You are correct as they have pointed out. I was the done that decided to sort on [('g', ASCENDING), ('q', ASCENDING), ('label', ASCENDING), ('p_rank', ASCENDING)]) without realizing that p_rank was pointless. I'm not so sure about the claim that a single index claim vs compound index. I read somewhere in the past that compound indexes are usually faster. I cannot find that exact quote anymore, but the best I could is:

If you have a collection that has both a compound index and an index on its prefix (e.g. { a: 1, b: 1 } and { a: 1 }), if neither index has a sparse or unique constraint, then you can remove the index on the prefix (e.g. { a: 1 }). MongoDB will use the compound index in all of the situations that it would have used the prefix index.

from https://docs.mongodb.com/manual/core/index-compound/

It also might be worth it to try unique indexes (https://docs.mongodb.com/ manual/core/index-unique/) for indexes involving label or other unique combinations.

I'm sorry for not giving any straightforward solutions, but this index thing is really tricky.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/LMFDB/lmfdb-inventory/issues/54#issuecomment-329165692, or mute the thread https://github.com/notifications/unsubscribe-auth/AABZCyXayGFRDviw4xuPXGVOJSNVAhERks5sh9bQgaJpZM4Ol5Sb .

AndrewVSutherland commented 6 years ago

I've updated the database on the cloud in case you want to do some performance testing there -- note that we are using wired tiger as the database engine on the cloud, and this can significantly impact query performance (in both good and bad ways).

edgarcosta commented 6 years ago

Here is the mongodb log while reading from the cloud and running the tests:

tail -f /var/log/mongodb/mongod.log | grep op_query --color
2017-09-24T20:19:26.692+0000 I COMMAND  [conn399556] command abvar.fq_isog command: count { count: "fq_isog", query: { is_jac: -1, is_pp: { $ne: -1 }, g: 3 } } planSummary: IXSCAN { g: 1, q: 1, is_jac: 1, sort: 1 } keyUpdates:0 writeConflicts:0 numYields:17 reslen:62 locks:{ Global: { acquireCount: { r: 36 } }, Database: { acquireCount: { r: 18 } }, Collection: { acquireCount: { r: 18 } } } protocol:op_query 353ms
2017-09-24T20:19:28.435+0000 I COMMAND  [conn399556] command abvar.fq_isog command: count { count: "fq_isog", query: { g: 2 } } planSummary: COUNT_SCAN { g: 1, q: 1, sort: 1 } keyUpdates:0 writeConflicts:0 numYields:4819 reslen:62 locks:{ Global: { acquireCount: { r: 9640 } }, Database: { acquireCount: { r: 4820 } }, Collection: { acquireCount: { r: 4820 } } } protocol:op_query 141ms
2017-09-24T20:19:29.377+0000 I COMMAND  [conn399556] command abvar.fq_isog command: count { count: "fq_isog", query: { g: 2 } } planSummary: COUNT_SCAN { g: 1, q: 1, sort: 1 } keyUpdates:0 writeConflicts:0 numYields:4819 reslen:62 locks:{ Global: { acquireCount: { r: 9640 } }, Database: { acquireCount: { r: 4820 } }, Collection: { acquireCount: { r: 4820 } } } protocol:op_query 128ms
2017-09-24T20:19:30.324+0000 I COMMAND  [conn399556] command abvar.fq_isog command: count { count: "fq_isog", query: { g: 2 } } planSummary: COUNT_SCAN { g: 1, q: 1, sort: 1 } keyUpdates:0 writeConflicts:0 numYields:4819 reslen:62 locks:{ Global: { acquireCount: { r: 9640 } }, Database: { acquireCount: { r: 4820 } }, Collection: { acquireCount: { r: 4820 } } } protocol:op_query 127ms
2017-09-24T20:19:43.609+0000 I COMMAND  [conn399556] command abvar.fq_isog command: count { count: "fq_isog", query: { $or: [ { decomp.0.0: { $regex: "^1" }, decomp.0.1: 1, decomp: { $size: 2 }, decomp.1.1: 1, decomp.1.0: "3.5.ah_y_ach" } ] } } planSummary: COLLSCAN keyUpdates:0 writeConflicts:0 numYields:10702 reslen:62 locks:{ Global: { acquireCount: { r: 21406 } }, Database: { acquireCount: { r: 10703 } }, Collection: { acquireCount: { r: 10703 } } } protocol:op_query 10341ms
2017-09-24T20:19:49.988+0000 I COMMAND  [conn399556] command abvar.fq_isog command: find { find: "fq_isog", filter: { $or: [ { decomp.0.0: { $regex: "^1" }, decomp.0.1: 1, decomp: { $size: 2 }, decomp.1.1: 1, decomp.1.0: "3.5.ah_y_ach" } ] }, sort: { sort: 1 }, limit: 50 } planSummary: IXSCAN { sort: 1 } keysExamined:1367543 docsExamined:1367543 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:10684 nreturned:9 reslen:8612 locks:{ Global: { acquireCount: { r: 21370 } }, Database: { acquireCount: { r: 10685 } }, Collection: { acquireCount: { r: 10685 } } } protocol:op_query 6342ms
2017-09-24T20:19:52.245+0000 I COMMAND  [conn399556] command abvar.fq_isog command: count { count: "fq_isog", query: { $or: [ { decomp.0.0: { $regex: "^1" }, decomp.0.1: 1, decomp: { $size: 2 }, decomp.1.1: 1, decomp.1.0: "3.5.ah_y_ach" } ] } } planSummary: COLLSCAN keyUpdates:0 writeConflicts:0 numYields:10683 reslen:62 locks:{ Global: { acquireCount: { r: 21368 } }, Database: { acquireCount: { r: 10684 } }, Collection: { acquireCount: { r: 10684 } } } protocol:op_query 1463ms
2017-09-24T20:19:56.033+0000 I COMMAND  [conn399556] command abvar.fq_isog command: find { find: "fq_isog", filter: { $or: [ { decomp.0.0: { $regex: "^1" }, decomp.0.1: 1, decomp: { $size: 2 }, decomp.1.1: 1, decomp.1.0: "3.5.ah_y_ach" } ] }, sort: { sort: 1 }, limit: 50 } planSummary: IXSCAN { sort: 1 } keysExamined:1367543 docsExamined:1367543 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:10683 nreturned:9 reslen:8612 locks:{ Global: { acquireCount: { r: 21368 } }, Database: { acquireCount: { r: 10684 } }, Collection: { acquireCount: { r: 10684 } } } protocol:op_query 3752ms

roed314 commented 6 years ago

The decomposition queries are slow, as expected. I don't see how to make these queries faster using the current schema, since the decomposition is stored as a list of lists (e.g. [['2.16.am_cn',1], ['1.16.ah',2]] ). Moreover, the code for the decomposition search currently supports constructing very complicated queries.

In the short term, we could disable the ability to specify the decomposition type. In the long term, we probably need to change the way the decomposition is stored (like we're changing the storage for point counts). I'll think about ways to do this....

What's the best way to do additional testing on the cloud? I don't think I have direct access to the logs that Edgar has been looking at; I could use something like urllib2 to request a bunch of searches from www.lmfdb.org and time them, but it seems like there should be a better way. David

On Sun, Sep 24, 2017 at 4:28 PM, Edgar Costa notifications@github.com wrote:

Here is the mongodb log while reading from the cloud and running the tests:

tail -f /var/log/mongodb/mongod.log | grep op_query --color 2017-09-24T20:19:26.692+0000 I COMMAND [conn399556] command abvar.fq_isog command: count { count: "fq_isog", query: { is_jac: -1, is_pp: { $ne: -1 }, g: 3 } } planSummary: IXSCAN { g: 1, q: 1, is_jac: 1, sort: 1 } keyUpdates:0 writeConflicts:0 numYields:17 reslen:62 locks:{ Global: { acquireCount: { r: 36 } }, Database: { acquireCount: { r: 18 } }, Collection: { acquireCount: { r: 18 } } } protocol:op_query 353ms 2017-09-24T20:19:28.435+0000 I COMMAND [conn399556] command abvar.fq_isog command: count { count: "fq_isog", query: { g: 2 } } planSummary: COUNT_SCAN { g: 1, q: 1, sort: 1 } keyUpdates:0 writeConflicts:0 numYields:4819 reslen:62 locks:{ Global: { acquireCount: { r: 9640 } }, Database: { acquireCount: { r: 4820 } }, Collection: { acquireCount: { r: 4820 } } } protocol:op_query 141ms 2017-09-24T20:19:29.377+0000 I COMMAND [conn399556] command abvar.fq_isog command: count { count: "fq_isog", query: { g: 2 } } planSummary: COUNT_SCAN { g: 1, q: 1, sort: 1 } keyUpdates:0 writeConflicts:0 numYields:4819 reslen:62 locks:{ Global: { acquireCount: { r: 9640 } }, Database: { acquireCount: { r: 4820 } }, Collection: { acquireCount: { r: 4820 } } } protocol:op_query 128ms 2017-09-24T20:19:30.324+0000 I COMMAND [conn399556] command abvar.fq_isog command: count { count: "fq_isog", query: { g: 2 } } planSummary: COUNT_SCAN { g: 1, q: 1, sort: 1 } keyUpdates:0 writeConflicts:0 numYields:4819 reslen:62 locks:{ Global: { acquireCount: { r: 9640 } }, Database: { acquireCount: { r: 4820 } }, Collection: { acquireCount: { r: 4820 } } } protocol:op_query 127ms 2017-09-24T20:19:43.609+0000 I COMMAND [conn399556] command abvar.fq_isog command: count { count: "fq_isog", query: { $or: [ { decomp.0.0: { $regex: "^1" }, decomp.0.1: 1, decomp: { $size: 2 }, decomp.1.1: 1, decomp.1.0: "3.5.ah_y_ach" } ] } } planSummary: COLLSCAN keyUpdates:0 writeConflicts:0 numYields:10702 reslen:62 locks:{ Global: { acquireCount: { r: 21406 } }, Database: { acquireCount: { r: 10703 } }, Collection: { acquireCount: { r: 10703 } } } protocol:op_query 10341ms 2017-09-24T20:19:49.988+0000 I COMMAND [conn399556] command abvar.fq_isog command: find { find: "fq_isog", filter: { $or: [ { decomp.0.0: { $regex: "^1" }, decomp.0.1: 1, decomp: { $size: 2 }, decomp.1.1: 1, decomp.1.0: "3.5.ah_y_ach" } ] }, sort: { sort: 1 }, limit: 50 } planSummary: IXSCAN { sort: 1 } keysExamined:1367543 docsExamined:1367543 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:10684 nreturned:9 reslen:8612 locks:{ Global: { acquireCount: { r: 21370 } }, Database: { acquireCount: { r: 10685 } }, Collection: { acquireCount: { r: 10685 } } } protocol:op_query 6342ms 2017-09-24T20:19:52.245+0000 I COMMAND [conn399556] command abvar.fq_isog command: count { count: "fq_isog", query: { $or: [ { decomp.0.0: { $regex: "^1" }, decomp.0.1: 1, decomp: { $size: 2 }, decomp.1.1: 1, decomp.1.0: "3.5.ah_y_ach" } ] } } planSummary: COLLSCAN keyUpdates:0 writeConflicts:0 numYields:10683 reslen:62 locks:{ Global: { acquireCount: { r: 21368 } }, Database: { acquireCount: { r: 10684 } }, Collection: { acquireCount: { r: 10684 } } } protocol:op_query 1463ms 2017-09-24T20:19:56.033+0000 I COMMAND [conn399556] command abvar.fq_isog command: find { find: "fq_isog", filter: { $or: [ { decomp.0.0: { $regex: "^1" }, decomp.0.1: 1, decomp: { $size: 2 }, decomp.1.1: 1, decomp.1.0: "3.5.ah_y_ach" } ] }, sort: { sort: 1 }, limit: 50 } planSummary: IXSCAN { sort: 1 } keysExamined:1367543 docsExamined:1367543 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:10683 nreturned:9 reslen:8612 locks:{ Global: { acquireCount: { r: 21368 } }, Database: { acquireCount: { r: 10684 } }, Collection: { acquireCount: { r: 10684 } } } protocol:op_query 3752ms

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/LMFDB/lmfdb-inventory/issues/54#issuecomment-331738034, or mute the thread https://github.com/notifications/unsubscribe-auth/AABZC2kvYRL3YlCMiHH3yjFTWEqXgxPyks5slrtogaJpZM4Ol5Sb .

AndrewVSutherland commented 6 years ago

You can use the tcurl.sh script in https://github.com/LMFDB/lmfdb/tree/master/perfmon to get end-to-end timings for a list of URLs that represent queries you want to time.

I think the short term goal should just be to have basic queries be quick (e.g. search by dimension, or cardinality of the base field, number of points), and to optimize the things that are easy to optimize.

I don't think the isogeny factor searches are so slow they should be disabled, I would leave them in (they could be really useful to someone who needs them, for a similarly slow/useful example, searching for number fields not ramified at a given set of primes can be really slow, but we still support it).

And I notice that the isogeny factors searches are much faster if you specify the base field -- could you perhaps have the query processor automatically do this even if the user doesn't think to do so, i.e. extract it from the factor(s) they specify? Based on the small amount of testing I did, I think this might make a significant difference, e.g. compare:

http://www.lmfdb.org/Variety/Abelian/Fq/?start=0&count=50&q=&g=&p_rank=&angle_ranks=&newton_polygon=&initial_coefficients=&abvar_point_count=&curve_point_count=&decomposition=%5B1.137.v%5D&number_field=&simple=any&primitive=any&jacobian=any&polarizable=any

and

http://www.lmfdb.org/Variety/Abelian/Fq/?start=0&count=50&q=137&g=&p_rank=&angle_ranks=&newton_polygon=&initial_coefficients=&abvar_point_count=&curve_point_count=&decomposition=%5B1.137.v%5D&number_field=&simple=any&primitive=any&jacobian=any&polarizable=any

Two more quick comments:

The two URLs above make it clear that it would be handy to add URL shortening, as discussed in https://github.com/LMFDB/lmfdb/issues/2277
I notice that if you type garbage into the isogeny factors search box (e.g. "abc") it just ignores it and does an unrestricted search. It should display an error message.

roed314 commented 6 years ago

I've done some performance tests (results here: https://gist.github.com/roed314/015024d3e8994f51d44007f7dabcb11d) and concluded that www.lmfdb.org isn't running the most recent version of the abelian varieties code (for example, http://www.lmfdb.org/Variety/Abelian/Fq/?abvar_point_count=14-28 returns no results, while on my computer it displays an error). Is it possible to push master to production so that we can do these tests?

AndrewVSutherland commented 6 years ago

Done -- the code should be running on www.lmfdb.com within 15 minutes.

roed314 commented 6 years ago

Something's wrong: when I go to http://www.lmfdb.org/Variety/Abelian/Fq/?polarizable=not_no I get a server error; both http://beta.lmfdb.org/Variety/Abelian/Fq/?polarizable=not_no and http://localhost:37777/Variety/Abelian/Fq/?polarizable=not_no work fine. http://www.lmfdb.org/Variety/Abelian/Fq/?polarizable=no doesn't cause an error, but also returns no results.

I've posted a list at https://gist.github.com/roed314/75fad21e98137845707ce2b6ddec0a85 showing which pages cause errors and which don't. Based on the number of bytes downloaded, it looks like even the ones that don't cause errors aren't downloading any results.

Any ideas?

AndrewVSutherland commented 6 years ago

Sorry, there was a problem with the copy and it was still using the old data format. It should be using the new data now, can you give it another try?

AndrewVSutherland commented 6 years ago

It is definitely much faster!

roed314 commented 6 years ago

Thanks! I've done some more timings at https://gist.github.com/roed314/f946eb20eed1ee357164e525a3c1c79a

The main thing that I notice is that the not_no and not_yes queries are slow. These are doing things like {'is_jac': {'$ne': 1}}. Is there a better way to construct such queries?

Of course, there are also slow queries that aren't handled well by the indexes, but we're not going to be able to solve this problem since there are too many things you might want to search for and mongo only supports 31 indexes on a collection.

roed314 commented 6 years ago

Another run: https://gist.github.com/roed314/a38e4221fd96bd91dc2e653a55b7bb28, which includes square brackets (I had to change them to %5B and %5D to make curl happy).

I don't notice any new issues.

AndrewVSutherland commented 6 years ago

@roed314 Most of the timings look quite good!

Regarding your question, using {'is_jac':{'$lt':1}} will give the same result and be much faster than '$ne':1 (null values will sort before 0). But this won't help {'is_jac':{'$ne':0}}. It's silly, but the fastest thing to do would be to have 'is_jac' (and 'is_principal') take 3 values, -1,0,1 corresponding to no,unknown,yes, and then you could support every search option with a single relational operator (using ne or a disjunction is much slower).

One strange thing I noticed is that if you type 1 into Leading Coefficients (rather than [1]) it is both slow and gives the wrong answer, compare:

http://www.lmfdb.org/Variety/Abelian/Fq/?start=0&count=50&q=&g=&p_rank=&ang_rank=&newton_polygon=&initial_coefficients=1&abvar_point_count=&curve_point_count=&decomposition=&number_field=&simple=any&primitive=any&jacobian=any&polarizable=not_yes

http://www.lmfdb.org/Variety/Abelian/Fq/?start=0&count=50&q=&g=&p_rank=&ang_rank=&newton_polygon=&initial_coefficients=%5B1%5D&abvar_point_count=&curve_point_count=&decomposition=&number_field=&simple=any&primitive=any&jacobian=any&polarizable=not_yes

Oddly, if you put in an unbracketed 2 it quickly returns an empty list of search results (whereas [2] does the right thing)

When parsing these queries you should either automatically put brackets around unbracketed inputs, or display an error telling the user that it is expecting a bracketed list. But I'm puzzled as to why using 1 matches a bunch (but not all) of the records, while using 2 matches none of them.

roed314 commented 6 years ago

Thanks! Check out https://github.com/LMFDB/lmfdb/pull/2284, which I think addresses both problems (though I'm still running tests).

edgarcosta commented 6 years ago

Great work @roed314!

roed314 commented 6 years ago

Thanks!

JohnCremona commented 6 years ago

Can this be close? I see that on the cloud db there are collections abvar.fq_isog_stats and fq_isog.stats which are similar -- perhaps one can be deleted? In the code only the one with a . is referenced. Dropping the extra one on the beta database is easy, but I'm not sure I know ho to drop on on the cloud.