HDFGroup / vol-rest

HDF5 REST VOL Connector
Other
5 stars 8 forks source link

Internal Server Error on localhost MinIO #27

Closed ron-kuhn closed 7 months ago

ron-kuhn commented 1 year ago

I am running MinIO on my localhost Win 10 laptop. I am also running HSDS on the same laptop (1sn and 4dns). I have a program that creates HDF5 files simulating images coming from an instrument. After about 2 minutes, things slow down eventually throwing an exception. When I look at the HSDS log file, then first error is an Internal Server Error (HTTP 500) for the MinIO port. This is very reproducible. Any ideas?

ron-kuhn commented 1 year ago

If I added Sleeps in my code to slow it down a bit then it works.

mattjala commented 1 year ago

This sounds like it might have been caused by issues with how the VOL was sending data to HSDS that were fixed in #26. If it persists in the current version of the VOL, could you share the HSDS log files?

ron-kuhn commented 1 year ago

I was testing the #26 fix when it occurred twice. But it hasn't happened since. I didn't save the hs.log, sorry. My test application simulates our instrument and send a bunch of data and images to HSDS through the rest-vol. If it is creating objects faster than the HSDS services can write then I see my application slow down for a while. I'm guessing the cache is full and has to wait for object to be written before continuing. This is when it appeared to happen but like I said, hasn't happened since.

jreadey commented 1 year ago

@ron-kuhn , HSDS uses lazy writing - i.e. a PUT/POST request to the DN will return while another task looks for dirty objects and flushes them after they've been dirty for 10 seconds or more (this logic is there to save on repeatedly writing an object to storage if it's being frequently updated).

If there are two many objects in the dirty queue, that may delay the time in which they get written out. If you scan through the HSDS logs looking for "s3sync" you might get some clues as to what's going on.

mattjala commented 7 months ago

This was likely an issue with insufficient resources being allocated for HSDS. If the issue is identified again, this can be reopened. Closing for now.