[Sentinel] Missing Sentinel-3 collections cause harvester errors

kmbn commented 6 years ago

The Sentinel harvester crashes while attempting to harvest certain Sentinel-3 products. The harvester cannot parse the identifier to produce a collection_id and, since collection_id is a required field, it crashes. It seems like Scihub added the new product types recently, as this error did not occur in the past. This issue affects production and needs to be addressed as soon as possible.

Below are examples of some of the products that the harvester cannot parse. It seems like there are three new types, but there could be more.

S3A_OL_2_LFR____20180716T175858_20180716T180158_20180716T194327_0180_033_269_3420_SVL_O_NR_002
S3A_OL_1_EFR____20180716T173758_20180716T174058_20180716T193329_0180_033_269_2160_SVLS3A_SL_2_LST____20180716T174406_20180716T174706_20180716T194054_0179_033_269_2520_SVL

joaandrade commented 6 years ago

@kmbn thank you for report. I will check this issue and try to include the new S3 collections. Can you please tell me which is the updated branch for the Sentinel harvester? I do not know since I was not the last one working on this harvester. Probably the master branch??

kmbn commented 6 years ago

The master branch has the latest versions of the harvesters (as long as they’ve been merged). You can make a new branch from master to work on this issue, and then make a PR back to master when it’s resolved. If there are still any Sentinel-related branches in the repo, they’re outdated and can be deleted. On 18. September 2018 at 18:50:56, João Andrade (notifications@github.com) wrote:

@kmbn thank you for report. I will check this issue and try to include the new S3 collections. Can you please tell me which is the updated branch for the Sentinel harvester? I do not know since I was not the last one working on this harvester.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

joaandrade commented 6 years ago

@michellecbox this issue is not solved yet. It is the only one missing.

JGulic commented 6 years ago

@joaandrade can we make this a priority?

I've been testing the harvesters and it seems that this error is also causing the Sentinel harvester to get stuck at some point and it doesn't finish.

Take a look at the screenshot from the logs: https://screenshots.firefox.com/MxG48GZws77hlkx7/147.228.242.207

The harvester log that shows 1 error and it's finished is a timeout error, but all the others where you can't see the Finish date, are throwing this error. We are planning a new release for Monday, and it would be great if we can include this as well.

joaandrade commented 6 years ago

@JGulic according to @kmbn the problem is in the parser for these new S3 collections. To solve the issue asap I will just include a try/catch in the code. For the future, the parser should be updated to handle the new collections. Do you agree with this approach?

If yes, can you please tell me which are the file/lines where the error happens? I imagine that the problem is in the esa_base file.

JGulic commented 6 years ago

@joaandrade yes a workaround would be great for now, since we need to be sure that the harvesters on production are running successfully. I can't really tell you where exactly the error happens, since I've just run the harvesters and I didn't debug them. You will need to check the esa harvester files. Try starting the sentinel harvester locally and see if you can catch the error.

joaandrade commented 6 years ago

@JGulic, @kmbn can you send me the configuration used in your instance to use a start date for a day with problematic datasets? I'm running the harvest for hours and I didn't find problematic datasets..

kmbn commented 6 years ago

The problem is that SentinellHarvester._add_collection() cannot parse certain identifiers. Here are three identifiers that it cannot parse (they're the ones I posted above).

S3A_OL_2_LFR____20180716T175858_20180716T180158_20180716T194327_0180_033_269_3420_SVL_O_NR_002

S3A_OL_1_EFR____20180716T173758_20180716T174058_20180716T193329_0180_033_269_2160_SVL

S3A_SL_2_LST____20180716T174406_20180716T174706_20180716T194054_0179_033_269_2520_SVL

The three identifiers above are just representative examples—any identifiers that follow the same pattern will also cause an error.

You don't need more than these three identifiers to verify the issue and create a fix. You also don't have to run the harvester itself.

Below is an interactive Python session showing how to reproduce the error. Before I show it, I'll explain how to create it so you can follow the same steps on your own machine.

I activated the virtual environment where CKAN, etc. is installed. I ran Python from the ckanext-nextgeossharvest/ckanext/nextgeossharvest/lib directory. You can run it from a different directory, but you'll need to change the import statement if you do.

I imported the SentinelHarvester and created an instance of it.

Then I created a dictionary representing an item to pass to the _add_collection() method. The item contains the first of the three examples of identifiers that cannot be parsed. When _add_collection() is called, an error is thrown. In this case, the error is AttributeError: 'SentinelHarvester' object has no attribute 'obj' in line 85 of the esa_base.py module. If you the code, you'll see that self._save_object_error(message, self.obj, "Import") is only called if the method is unable to parse the identifier.

Then I created a second item dictionary, this time with an identifier that can be parsed (the related product has already been harvested). When _add_collection() is called, the collection is added to the item dictionary.

(venv) me (master) /Users/me/projects/ckan/venv/src/ckanext-nextgeossharvest/ckanext/nextgeossharvest/lib $ python
Python 2.7.15 (default, Jun 17 2018, 12:46:58) 
[GCC 4.2.1 Compatible Apple LLVM 9.1.0 (clang-902.0.39.2)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from esa_base import SentinelHarvester
>>> harvester = SentinelHarvester()
>>> item_collection_missing = {"identifier": "S3A_OL_2_LFR____20180716T175858_20180716T180158_20180716T194327_0180_033_269_3420_SVL_O_NR_002"}
>>> harvester._add_collection(item_collection_missing)
No handlers could be found for logger "esa_base"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "esa_base.py", line 85, in _add_collection
    self._save_object_error(message, self.obj, "Import")
AttributeError: 'SentinelHarvester' object has no attribute 'obj'
>>> item_collection_ok = {"identifier": "S3A_SR_2_LAN____20180925T101156_20180925T102155_20180925T122712_0599_036_122______SVL_O_NR_003"}
>>> harvester._add_collection(item_collection_ok)
{'collection_id': 'SENTINEL3_SRAL_L2_LAN', 'collection_description': 'SENTINEL-3 is the first Earth Observation Altimetry mission to provide 100% SAR altimetry coverage where LRM is maintained as a back-up operating mode. This is a product of Level 2 processing and geographical coverage over land.', 'identifier': 'S3A_SR_2_LAN____20180925T101156_20180925T102155_20180925T122712_0599_036_122______SVL_O_NR_003', 'collection_name': 'Sentinel-3 SRAL Level-2 Land'}
>>>

You can write unit tests to ensure that all the Sentinel identifiers provided by Scihub, etc. can be parsed. There's no need to actually harvest products to do so.

joaandrade commented 6 years ago

Thank you @kmbn. With this I think we can implement a workaround during the next days. We will not include an extra collection for now. We will just try to catch the exception.

joaandrade commented 6 years ago

@kmbn, @JGulic

Apparently the harvester was crashing because of the following line of code that was, in fact, not necessary: self._save_object_error(message, self.obj, "Import")

If the identifier does not correspond to any collection only a log.warning should be raised. Can you please test with the updated code (commit 0e4fd11)?

A pull request was created for this issue.

Regards, João

kmbn commented 6 years ago

The PR does not solve the problem. The problem is that the harvester cannot parse certain identifiers in order to create collection metadata. If there's no collection metadata, there will be errors.

The example that I posted above was just to show you where the heart of the problem is as quickly as possible (i.e., no need to run a harvester or mock up a lot of metadata). The solution I hoped for (but didn't specifically ask for) was that the _add_collection() method would be updated to create collection metadata for the new Sentinel-3 products.

Look at lines 213–215 of the original file (or lines 212–214 of the modified file). _add_collection() adds the collection metadata to the item. This metadata includes collection_name. Then item["title"] is created using the item["collection_name"]. When the harvester tries to import one of the products that does not have a harvest and it doesn't throw an error when _add_collection() is called, it will throw an error here.

This is where the error actually happens when the harvester is running. The line that the pull request deletes just records an error message when the full harvester is running.

The line that you propose deleting records an harvest error message that includes the identifier of the product that caused the error and which appears in the job reports. It lets us see what product identifier caused the problem. It is not the cause of the problem.

The harvester (and the rest of the data hub, as well as the OpenSearch interface) is designed with the assumption that every product will have collection metadata. Attempting to create datasets without collection metadata is only going to cause trouble, either now or in the future.

There are two acceptable solutions: either just don't harvest products that don't have collections (update the harvester so that it just doesn't gather products that don't fit into the collections that you want to create or abort the import stage whenever one of those products appears in the import queue) or harvest them and create valid collections for them. We would prefer a solution that doesn't lead to any errors at all.

If you don't want to create collections for these products, the cleanest solution is probably filtering them out during the gather stage so that they never make it to the import stage.

kmbn commented 6 years ago

There are at least 317,952 products on CODE-DE that the harvester can't create collections for. I'm not sure how many there are on the other DHuS instances, or whether there are more on CODE-DE that we haven't encountered yet.

Consider this query: https://code-de.org/dhus/search?q=S3A_SL%20OR%20S3A_OL. All of those products will cause the harvester to crash

joaandrade commented 6 years ago

@kmbn all unknown collections were added. Now the harvester works properly for all the Sentinel-3 collections. Collections added (collection_id): SENTINEL3_OLCI_L1_EFR SENTINEL3_OLCI_L1_ERR SENTINEL3_OLCI_L2_LFR SENTINEL3_OLCI_L2_LRR SENTINEL3_SLSTR_L1_RBT SENTINEL3_SLSTR_L2_LST

The harvester runs properly when it finds products from these new collections. For now, these collections will only be found in CODE-DE, since they are collecting the datasets from the SciHub pre-operational environment. However, in the future, our harvester will be prepared to deal with this collections.

michellecbox commented 6 years ago

This was merged to master

NextGeoss / ckanext-nextgeossharvest

[Sentinel] Missing Sentinel-3 collections cause harvester errors #65