LouisK130 / IFCB-Annotate

A web-based interface for classifying IFCB image data
3 stars 2 forks source link

strange cache file pathname causes 500 #42

Closed joefutrelle closed 6 years ago

joefutrelle commented 6 years ago

Somehow this is happening:

[Mon Jul 02 20:44:22.859526 2018] [wsgi:error] [pid 9476]   File "/var/www/classify/classify/views.py", line 161, in post
[Mon Jul 02 20:44:22.859527 2018] [wsgi:error] [pid 9476]     targets = utils.getTargets(current_bins, request.session['timeseries'])
[Mon Jul 02 20:44:22.859529 2018] [wsgi:error] [pid 9476]   File "/var/www/classify/classify/utils.py", line 96, in getTargets
[Mon Jul 02 20:44:22.859531 2018] [wsgi:error] [pid 9476]     new_targets = parseBinToTargets(bin, timeseries)
[Mon Jul 02 20:44:22.859532 2018] [wsgi:error] [pid 9476]   File "/var/www/classify/classify/utils.py", line 62, in parseBinToTargets
[Mon Jul 02 20:44:22.859534 2018] [wsgi:error] [pid 9476]     f = open(TARGETS_CACHE_PATH + '/' + bin + '_temp', 'w+')
[Mon Jul 02 20:44:22.859537 2018] [wsgi:error] [pid 9476] FileNotFoundError: [Errno 2] No such file or directory: '/var/www/classify/classify/cache/targets/http://ifcb-data.whoi.edu/mvco/D20170611T205930_IFCB010_temp'

I think this might be user error, but there should be a more graceful response than a 500.

LouisK130 commented 6 years ago

I suspect this is because one of you recently updated the timeseries labels from "http" to "https". The pattern matching isn't exactly robust; the one character difference causes the timeseries url to not be stripped from the bin name, and you end up with issues like this.

I try to validate bins by looking for a 404, in which case you would've seen a slightly friendlier error, but the dashboard is awfully lenient with URLs and it makes it tough. In this case, for instance, we called out to:

https://ifcb-data.whoi.edu/mvco/http://ifcb-data.whoi.edu/mvco/D20170611T205930_IFCB010

And because this responds happily, we consider

http://ifcb-data.whoi.edu/mvco/D20170611T205930_IFCB010

an appropriate bin name stripped of its timeseries, when in fact it's not. Can you suggest a better way to validate bin names? This is an issue I'm experiencing in v2 development as well.

LouisK130 commented 6 years ago

This particular instance is actually because the bad bin has non-filename characters. I've added a bit of error checking that should resolve this, but a more strict bin verification method would still be very nice.

joefutrelle commented 6 years ago

A bin ID is either a "pid" with the full URL, or "lid" that doesn't have the leading timeseries namespace. Getting the lid is pretty trivial because you can just chop off everything before and including the trailing slash.

The JSON returned has a "pid" field containing the pid.

For various reasons it's important that the dashboard accept pids and not just lids. A compromise might be to return the lid as a property in the JSON response. That would also fix the case where the URL ends with an extension or something like "_short" followed by an extension.