Reading-eScience-Centre / ncwms

ncWMS - A Web Map Service for displaying environmental data over the web
Other
62 stars 30 forks source link

handling of dynamicDatasetCache is not thread safe #51

Open MatsSundqvist opened 4 years ago

MatsSundqvist commented 4 years ago

It seems like the handling of dynamicDatasetCache needs to be made thread safe. Currently the dynamic datasets is not working in a stable way. Own test with readlock and writelock seems to work well

rod-glover commented 3 years ago

@MatsSundqvist , I believe I may be encountering problems with the same root. We have configured a large collection of large datasets as dynamic datasets. We get frequent, unpredictable and inconsistent errors from ncWMS when these datasets are accessed (both by our web mapping app and by some test code I created). The obvious suspicion is that the dynamic dataset code is not handling these parallel requests safely, i.e., is not thread-safe.

I am a complete Java noob and so is everyone else in our shop (we primarily use Python and JS). Since ncWMS dynamic datasets (working stably) would solve some major headaches we have, I would appreciate any advice you may have on resolving this difficulty. In particular, to what did you refer in your remark

Own test with readlock and writelock seems to work well

?

Any other thoughts or advice are also very welcome.

guygriffiths commented 3 years ago

Any further info you can give us (e.g. what sorts of errors you're getting, how frequently you're accessing various things, etc.) may be helpful.

rod-glover commented 3 years ago

Thank you for your quick response, Guy.

As I note below, I don't think my particular issue is a caching issue, although it might well be related to a general issue with thread safety, which is to say requests that demand multiple access to the same objects and/or files. If you'd like me to file a separate issue on this, I'd be happy to do so. Meantime I am responding here:

Regarding dynamic datasets:

Regarding non-dynamic datasets:

rod-glover commented 3 years ago

Regarding the Admin API, do you mean the REST API implemented in uk.ac.rdg.resc.edal.ncwms.NcwmsApiServlet? That service implements only a GET request, and we would need a POST in order to load configuration info via it. Am I missing something?

mtsales commented 3 years ago

I'm facing the same problem as @rod-glover . This has been reported long time ago in Edal https://github.com/Reading-eScience-Centre/edal-java/issues/106

I can read the datasets if they are defined in the ncWMS admin but it fails multiple times and randomly when using the datasets as dynamic datasets

The error is always related to reading the time coordinate 2020-10-16 07:26:59 ERROR GridCoordSys:352 - N/A: Error reading time coord= time java.lang.ArrayIndexOutOfBoundsException: Index 159200 out of bounds for length 8092

After this error occurs, I'm no longer able to use the dataset. It's always reports 2020-10-16 07:51:55 WARN WmsServlet:2641 - Wms Exception caught: "The layer local/NDVI_5600m_2019.nc/NDVI was not found on this server"

Please notice I have cache for dynamic datasets disabled but the same happens when enabling cache

MatsSundqvist commented 3 years ago

here is my version of NcwmsCatalogue. I can not guarantee that it's flawless, but I experienced that it worked quite well. The original version did not work at all when several requests for tiles came in. NcwmsCatalogue.zip

guygriffiths commented 3 years ago

@rod-glover The API is an admin API. It was written by a 3rd party and I don't think they documented it. It's implemented in uk.ac.rdg.resc.edal.ncwms.NcwmsAdminServlet and takes POST requests to paths such as .../admin/updateConfig, .../admin/addDataset.

Thanks for the other information. I'll try this with a bunch of data today and see if I can replicate anything. For the timestep problem, can you confirm whether making a request to /wms?request=GetMetadata&version=1.3.0&item=timesteps&day=yyyy-mm-dd&layerName=<whatever> is showing a timestamp which is subsequently invalid in a GetMap request.

For the LayerNotDefined issue, this is a case of ncWMS failing and swallowing the error. I'm going to do some more investigation today, but if I can't replicate anything, I'll create a build which prints the stack trace to the Tomcat logs so you can (hopefully) get some information about the real cause of the error.

guygriffiths commented 3 years ago

@mtsales - ~Can you provide a copy of the NDVI data which is failing, and the request string you're using? I've tried with some NDVI data I picked at random, but can't replicate the problem.~ Actually I think I can replicate this (or at least get a similar error) with a bunch of concurrent requests. Let me investigate further

mtsales commented 3 years ago

@guygriffiths do you need the full stack trace?

guygriffiths commented 3 years ago

@mtsales - No, I don't think I need anything for now, thanks. I can trigger similar errors by making enough concurrent requests, which suggests a threading issue. I'll see how far I can get with it and report back.

guygriffiths commented 3 years ago

I've implemented a potential fix and would be grateful if you could try out the version here: https://github.com/Reading-eScience-Centre/ncwms/releases/tag/1.5.1-SNAPSHOT-dynamic-dataset-fix

Let me know if this fixes the issues, and if not, what the Tomcat logs say when datasets fail.

mtsales commented 3 years ago

@guygriffiths this seems to be working now but I need further testing. I'll let you know beginning of next week

I have dynamic datasets cache disabled but still get lots of exceptions (one per GetMap request) related to cache in ncWMS.log

2020-10-16 13:07:22 ERROR DataCatalogue:495 - Problem adding features to cache java.lang.reflect.InaccessibleObjectException: Unable to make field private final java.lang.Object jdk.internal.loader.AbstractClassLoaderValue$Sub.key accessible: module java.base does not "opens jdk.internal.loader" to unnamed module @724945

guygriffiths commented 3 years ago

@mtsales That's a strange one, I've not seen it before. Are you using Java 11? This latest build is expecting Java 11, and most of the results I get from searching that error list it as a specific Java 9 issue.

mtsales commented 3 years ago

@guygriffiths I'm using JDK 11.0.8 (as the only installed java) and Tomcat 9.0.39. I keep getting this exception once per GetMap request although the image renders correctly.

Attached a ncWMS.log (check the time stamp of the exceptions - one per GetMap request) and my config.xml where dynamic datasets cache is disabled.

ncwms.log config.xml.txt

rod-glover commented 3 years ago

@guygriffiths I was off on Friday but will turn my attention to this today.

rod-glover commented 3 years ago

@guygriffiths , your update seems to have solved both error cases I encountered. I haven't worked it extremely hard yet, but test cases that fail for our current installation of 2.4.2 work on your snapshot. I will test it harder this afternoon and report back to you.

rod-glover commented 3 years ago

@guygriffiths , further testing indicates that the updated version works

I'd say this looks like a success. Thank you!

mtsales commented 3 years ago

@guygriffiths I can also confirm it solved the issues I was facing with dynamic datasets

guygriffiths commented 3 years ago

Good to hear it's working.

@mtsales - The error you're seeing is referring to the feature cache, which caches data for individual map tiles, meaning things like changing the colour scale very fast. I haven't been able to replicate it yet though.

rod-glover commented 3 years ago

@guygriffiths , a fairly rigorous testing programme for your fix shows no problems.

rod-glover commented 3 years ago

@guygriffiths , I do not want to hassle you, but I do have a question that may affect some deployment scheduling in my work group: Do you know when we might see an official release incorporating this fix?

guygriffiths commented 3 years ago

I don't have any plans for a release in the near future, since I'm reasonably busy with other projects. Are you able to use this release? A non-snapshot release would not be any different from this one, other than in name.

rod-glover commented 3 years ago

Yes, we can use the snapshot. Thank you.