add resilience to remote index being down during list files

alaniwi commented 4 years ago

When "list files" is clicked, it seems that the value of index_node from the dataset metadata is checked, and this is used to send a query to that index, with the following format:

https://....../esg-search/search?type=File&dataset_id=......&format=application%2Fsolr%2Bjson&offset=0&limit=10&distrib=false

If a dataset record originates from a remote Solr shard, then this can fail if the remote index is down, even though a local replica for that shard may be available. That is to say, the local replica contains both the datasets and files cores but CoG does not attempt to utilise the locally held files info.

How about changing it so that in the event of an unsuccessful response from the remote index (e.g. a 500 or a timeout), it falls back to trying the same search on the local index node? This fallback search would need to omit the distrib=false, making it more expensive because other unrelated shards are also queried, which is why I am suggesting it only as a fallback.

sashakames commented 4 years ago

@nathanlcarlson is working on a change for these File queries to go to the originating esg-search with distrib=false.

nathanlcarlson commented 4 years ago

@alaniwi Yea, it also seems this same pattern is used for wget script retrieval

As I understand it, there should not be a need for CoG to ever query a remote esg-search API.

alaniwi commented 4 years ago

Hi, the reason why I was suggesting it only as a fallback is as follows.

If the remote index is used, then it can use distrib=false in order to avoid searching any other shards, because it is known in that case that the "local slave" (port 8983) on that remote node will contain the records of relevance.
On the other hand, if the local node is used, then there is no way for CoG or esg-search to know which Solr port contains the replica of the node in question (for example in our case, the mapping will not even appear in esgf_shards.config because the replicas are hosted locally on another index node) so the distrib=true will be needed, placing a load on all the local shards (local slave and all the replicas). This is already the case for datasets queries, but I am guessing that files queries would place a bigger load because of the number of files compared to datasets.

Here is a possible alternative suggestion that could be used to mitigate this if you prefer never to query a remote search API.

In the local esgf_shards_static.xml file, add an optional way to declare to esg-search which master node a local shard is replicating:

For example, where it currently has:

<value>localhost:8985/solr</value>

(or in our case something like <value>hostname:8985/solr</value> where hostname is the node that hosts the replicas)

it could be changed to something like:

<value master="esgf-node.llnl.gov">localhost:8985/solr</value>

then esg-search/search could be extended so that if it sees the index_node parameter in a query, then it does the following:

if the value of index_node in the query matches the hostname of the master (determined as described below) for any shard it knows about, then send the Solr query directly to that shard without the shards Solr parameter (i.e. implying distrib=false)
otherwise, fall back to existing behaviour (use localhost:8983, conditionally including the shards Solr parameter as determined by the distrib= setting)

where the hostname of the master would be determined as follows (in practice the following rules would be used to build a mapping dictionary or similar during service initialisation, which is subsequently used for any lookups):

if the value element in the XML file has a master property, use the value of master
else, if the value itself has a value of the form <host>/solr or <host>:80/solr or <host>:8983/solr, use that hostname, except that if the hostname is localhost then it is replaced by the FQDN (this rule will match the local index plus any remotes that are queried directly rather than replicated locally)
else, just ignore this record for purposes of this optimisation (although it is still included in any distributed searches)

Then CoG could be changed to add the index_node parameter based on the value found in the dataset document, in addition to the other changes discussed above, for example:

https://localhost/esg-search/search?type=File&index_node=esgf-node.llnl.gov&dataset_id=.......&format=application%2Fsolr%2Bjson&offset=0&limit=10&distrib=true

If there is a version of esg-search that contains this feature, then based on the above configuration, this would then cause esg-search to query localhost:8985/solr and the distrib=true would be ignored. If not - whether because no match is found, or it is an older version of esg-search - then it would do a distributed search.

What you you think?

(If you think it would work, then maybe the thing to do would be to use this issue for the CoG changes but paste the esg-search related changes into a separate issue at https://github.com/ESGF/esg-search/issues .)

nathanlcarlson commented 4 years ago

Hello @alaniwi ,

I agree with the need to improve this situation.

I believe any clients (CoG in this case) should be implementation unaware. Clients should be able to submit queries to any search endpoint and get the same results, for the same query. Clients should not need to play a role in the routing to shards because clients shouldn't know shards exist.
I believe we are sharding on the wrong dimension and perhaps we shouldn't be sharding at all. We should be using the project dimension. We should either a) shard by project or b) have each project have its own Solr collection. Many searches are contained within a project and are required to be spread across index nodes (the current sharding system), so why shard on index nodes. Option "b" provides the additional benefit of allowing each project to have a more customized schema.
Query and Document routing across shards can be handled within Solr, if SolrCloud is used. This would remove the need to add a lot of logic to our search API. See here: https://lucene.apache.org/solr/guide/8_1/shards-and-indexing-data-in-solrcloud.html#document-routing

Making any of the above, large-scale, changes would be difficult though. Whereas your suggestions are more realistic.

alaniwi commented 4 years ago

@nathanlcarlson

Thanks for your response.

In my proposed solution, CoG does not really need to be aware of implementation, beyond the fact it you should not use distrib=false. (By leaving it out, esg-search should then default to distrib=true in cases where it cannot identify a particular replica to query, either because it doesn't find a match or it is an version of esg-search that lacks this functionality -- see above.) In some sense, this change means removing an implementation-aware issue from the existing setup.

The one addition which will need to be made from the point of view of CoG is to add the index_node parameter to the query. Whether this would be classified as "implementation aware" is debatable, but the fact is that unless this is specified, there is no way for esg-search to know which replica to query, so without it the optimisation of only searching the relevant shard is not possible although the query would still work. (In principle esg-search could do a distributed search on the dataset to find out the index node before doing a files query of the individual shard, but in the CoG situation, it has already done the dataset search so it is a pity not to utilize that by passing in the result using the index_node parameter.) The client does not need to know the actual URL of the replica, so to that extent, it would still be implementation unaware.

alaniwi commented 4 years ago

@nathanlcarlson I have update my long comment above, as regards the hostname of the master shard. See "where the hostname of the master would be determined..." and the following bulleted list.

nathanlcarlson commented 4 years ago

Hi @alaniwi, we opted to do the following: https://github.com/EarthSystemCoG/COG/pull/1420 It really isn't a great fix, but it resolves the issue of sending queries to remote search APIs.

Certainly, there could be a parameter that clients specify to make the query more efficient by not requiring a search across the entire collection (index_node proposed by yourself). Such a parameter is really no different from other key-value pairs specified by clients to reduce the result set of the search, from the client's perspective. I agree routing queries by index_node would work well with our current search API and Solr configuration/architecture.

sashakames commented 4 years ago

My hypothesis is that doing a "distrib=true" for file queries would add negligible load to an index server. The shards should very quickly determine that the files aren't present and return 0 results. Moving to solr cloud would make this unnecessary.

EarthSystemCoG / COG

add resilience to remote index being down during list files #1418