CS-SI / eodag

Earth Observation Data Access Gateway
https://eodag.readthedocs.io
Apache License 2.0
324 stars 44 forks source link

Search for products among multiple providers #163

Open sbrunato opened 3 years ago

sbrunato commented 3 years ago

Original request made by @geonux :

To be able to search for all available data on all providers over a given AOI

We decided that the best way to approach this was to add a providers list parameter to the search method. A list of all the providers can be retrieved with dag.available_providers(). But a user could also provide a subset of the available providers:

dag.search(args, providers=["theia", "sobloo"])

Note that there is already a provider kwarg that the user can pass to search and that is used by _search_by_id (for performance reasons, if the user already knows if a given provider has this product available).

Note that the search command doesn't accept any --provider option.

For retrieving all the product types, we have to deal with the already used productType parameter. Options would be:

_Note here that dag.list_product_types(provider) could come in handy if productType accepts a list of product types._

TODOs before working on a MR:

maximlt commented 3 years ago

Here is a snippet that allows to search for all the products available on all the providers in a given area of interest (around Toulouse here) some time in August 2020:

from eodag import EODataAccessGateway
from eodag.api.search_result import SearchResult
from eodag.utils.logging import setup_logging
setup_logging(verbose=1)
dag = EODataAccessGateway()
search_criteria = dict(
    start='2020-08-01',
    end='2020-08-10',
    geom=[0, 43, 2, 45],
)
all_prods = SearchResult([])
# Loop over ALL the providers
for provider in dag.available_providers():
    # Set it as the preferred one
    dag.set_preferred_provider(provider=provider)
    # Get the product ID, i.e. the products types (e.g. S2_MSI_L1C), for this provider
    product_types = (
        p["ID"]
        for p in dag.list_product_types(provider=provider)
        if p["ID"] != "GENERIC_PRODUCT_TYPE"
    )
    # And loop over them and search all the products available
    for product_type in product_types:
        try:
            results = dag.search_all(productType=product_type, **search_criteria)
        except Exception:
            print(f"Failed to collect '{product_type}' products with '{provider}'")
            results = []
        print(f"Got {len(results)} '{product_type}' products with '{provider}'")
        all_prods.extend(results)
print(f"Got a total of {len(all_prods)} products.")

(I got 1090 products)

maximlt commented 3 years ago

@geonux since you're at the origin of this issue, I would like to ask you a few questions about it if I may.

To be able to search for all available data on all providers over a given AOI

I think it can be translated into two different ways:

  1. To get all the products from all the providers
  2. To get all the unique products from all the providers

Indeed, it is quite sure that were will be duplicate products (both provider A and B offer the same product i) from a search over the same AOI and time period.

If you are interested in 1., the snippet above should get you what want. Would it be enough to document it?

If you are interested in 2., this is trickier. We would need to remove duplicate products. We could rely on the product unique identifier, however, as shown in https://github.com/CS-SI/eodag/issues/136#issuecomment-808082569, we can't always make sure that different providers use the same id (surprisingly!). So there may still be some duplicates after an id filter. We could also rely on a combination of properties, and remove duplicates if 2 or more products share the same combination of, for instance, product_type / geometry / start date / end date.

Removing duplicates based on the id can be done as follows:

almost_unique_prods = SearchResult({p.properties["id"]: p for p in all_prods[::-1]}.values())

An attempt to remove potential duplicate products from almost_unique_prods could be done as follows:

unique_prods = SearchResult({
    (p.properties["startTimeFromAscendingNode"], p.geometry.wkt, p.product_type): p
    for p in almost_unique_prods[::-1]
}.values())

If we implement 2., internally we could add __eq__ (and __hash__?) to the EOProduct class, to specify how we define whether two products are the same or not.

Note the reverse order on all_prods and almost_unique_prods in the dict comprehension above. Its aim is to ensure that, if there are duplicate products, the one that end up in unique_prods is the one obtained from the first provider (the first one that offered this product). If we implement 2. we should ensure this priority is preserved, e.g. search_all(geom=..., start=..., end=..., providers=["peps", "sobloo"]) should return products from peps in priority.

If we decide to implement 1. or 2., where should add this?

sbrunato commented 3 years ago

Having differently formatted ids for the same products, depending on the providers must be fixed.

But this is also related to the fact that some providers are not (yet) configured to return Sentinel products in SAFE format. See #216 and #171