Open alaniwi opened 4 years ago
@nathanlcarlson is working on a change for these File queries to go to the originating esg-search with distrib=false.
@alaniwi Yea, it also seems this same pattern is used for wget script retrieval
As I understand it, there should not be a need for CoG to ever query a remote esg-search API.
Hi, the reason why I was suggesting it only as a fallback is as follows.
If the remote index is used, then it can use distrib=false
in order to avoid searching any other shards, because it is known in that case that the "local slave" (port 8983) on that remote node will contain the records of relevance.
On the other hand, if the local node is used, then there is no way for CoG or esg-search to know which Solr port contains the replica of the node in question (for example in our case, the mapping will not even appear in esgf_shards.config
because the replicas are hosted locally on another index node) so the distrib=true
will be needed, placing a load on all the local shards (local slave and all the replicas). This is already the case for datasets
queries, but I am guessing that files
queries would place a bigger load because of the number of files compared to datasets.
Here is a possible alternative suggestion that could be used to mitigate this if you prefer never to query a remote search API.
In the local esgf_shards_static.xml
file, add an optional way to declare to esg-search which master node a local shard is replicating:
For example, where it currently has:
<value>localhost:8985/solr</value>
(or in our case something like <value>hostname:8985/solr</value>
where hostname
is the node that hosts the replicas)
it could be changed to something like:
<value master="esgf-node.llnl.gov">localhost:8985/solr</value>
then esg-search/search
could be extended so that if it sees the index_node
parameter in a query, then it does the following:
index_node
in the query matches the hostname of the master (determined as described below) for any shard it knows about, then send the Solr query directly to that shard without the shards
Solr parameter (i.e. implying distrib=false
)localhost:8983
, conditionally including the shards
Solr parameter as determined by the distrib=
setting)where the hostname of the master would be determined as follows (in practice the following rules would be used to build a mapping dictionary or similar during service initialisation, which is subsequently used for any lookups):
value
element in the XML file has a master
property, use the value of master
value
itself has a value of the form <host>/solr
or <host>:80/solr
or <host>:8983/solr
, use that hostname, except that if the hostname is localhost
then it is replaced by the FQDN (this rule will match the local index plus any remotes that are queried directly rather than replicated locally)Then CoG could be changed to add the index_node
parameter based on the value found in the dataset document, in addition to the other changes discussed above, for example:
https://localhost/esg-search/search?type=File&index_node=esgf-node.llnl.gov&dataset_id=.......&format=application%2Fsolr%2Bjson&offset=0&limit=10&distrib=true
If there is a version of esg-search that contains this feature, then based on the above configuration, this would then cause esg-search to query localhost:8985/solr
and the distrib=true
would be ignored. If not - whether because no match is found, or it is an older version of esg-search - then it would do a distributed search.
What you you think?
(If you think it would work, then maybe the thing to do would be to use this issue for the CoG changes but paste the esg-search related changes into a separate issue at https://github.com/ESGF/esg-search/issues .)
Hello @alaniwi ,
I agree with the need to improve this situation.
I believe any clients (CoG in this case) should be implementation unaware. Clients should be able to submit queries to any search endpoint and get the same results, for the same query. Clients should not need to play a role in the routing to shards because clients shouldn't know shards exist.
I believe we are sharding on the wrong dimension and perhaps we shouldn't be sharding at all. We should be using the project dimension. We should either a) shard by project or b) have each project have its own Solr collection. Many searches are contained within a project and are required to be spread across index nodes (the current sharding system), so why shard on index nodes. Option "b" provides the additional benefit of allowing each project to have a more customized schema.
Query and Document routing across shards can be handled within Solr, if SolrCloud is used. This would remove the need to add a lot of logic to our search API. See here: https://lucene.apache.org/solr/guide/8_1/shards-and-indexing-data-in-solrcloud.html#document-routing
Making any of the above, large-scale, changes would be difficult though. Whereas your suggestions are more realistic.
@nathanlcarlson
Thanks for your response.
In my proposed solution, CoG does not really need to be aware of implementation, beyond the fact it you should not use distrib=false
. (By leaving it out, esg-search should then default to distrib=true in cases where it cannot identify a particular replica to query, either because it doesn't find a match or it is an version of esg-search that lacks this functionality -- see above.) In some sense, this change means removing an implementation-aware issue from the existing setup.
The one addition which will need to be made from the point of view of CoG is to add the index_node
parameter to the query. Whether this would be classified as "implementation aware" is debatable, but the fact is that unless this is specified, there is no way for esg-search to know which replica to query, so without it the optimisation of only searching the relevant shard is not possible although the query would still work. (In principle esg-search could do a distributed search on the dataset to find out the index node before doing a files query of the individual shard, but in the CoG situation, it has already done the dataset search so it is a pity not to utilize that by passing in the result using the index_node
parameter.) The client does not need to know the actual URL of the replica, so to that extent, it would still be implementation unaware.
@nathanlcarlson I have update my long comment above, as regards the hostname of the master shard. See "where the hostname of the master would be determined..." and the following bulleted list.
Hi @alaniwi, we opted to do the following: https://github.com/EarthSystemCoG/COG/pull/1420 It really isn't a great fix, but it resolves the issue of sending queries to remote search APIs.
Certainly, there could be a parameter that clients specify to make the query more efficient by not requiring a search across the entire collection (index_node
proposed by yourself). Such a parameter is really no different from other key-value pairs specified by clients to reduce the result set of the search, from the client's perspective. I agree routing queries by index_node
would work well with our current search API and Solr configuration/architecture.
My hypothesis is that doing a "distrib=true" for file queries would add negligible load to an index server. The shards should very quickly determine that the files aren't present and return 0 results. Moving to solr cloud would make this unnecessary.
When "list files" is clicked, it seems that the value of
index_node
from the dataset metadata is checked, and this is used to send a query to that index, with the following format:https://....../esg-search/search?type=File&dataset_id=......&format=application%2Fsolr%2Bjson&offset=0&limit=10&distrib=false
If a dataset record originates from a remote Solr shard, then this can fail if the remote index is down, even though a local replica for that shard may be available. That is to say, the local replica contains both the
datasets
andfiles
cores but CoG does not attempt to utilise the locally heldfiles
info.How about changing it so that in the event of an unsuccessful response from the remote index (e.g. a 500 or a timeout), it falls back to trying the same search on the local index node? This fallback search would need to omit the
distrib=false
, making it more expensive because other unrelated shards are also queried, which is why I am suggesting it only as a fallback.