astropy / pyvo

An Astropy affiliated package providing access to remote data and services of the Virtual Observatory (VO) using Python.
https://pyvo.readthedocs.io/en/latest
BSD 3-Clause "New" or "Revised" License
74 stars 50 forks source link

Refactor the code that produces the name of the files returned by services #573

Open andamian opened 5 days ago

andamian commented 5 days ago

Refactor suggest_dataset_basename (which has fairly parallel implementations in ssap and sia on top) to produce correct file names across platforms. Discuss whether for security reasons we'd want to limit the charset to ASCII printables. See related #557

msdemlei commented 5 days ago

On Tue, Jul 02, 2024 at 06:18:23PM +0000, Adrian wrote:

Refactor suggest_dataset_basename (which has fairly parallel implementations in ssap and sia on top) to produce correct file names across platforms. Discuss whether for security reasons we'd want to limit the charset to ASCII printables. See related #557

When someone touches this code, I'd also like it if we re-considered the design where you can pass in directories; this can lead to surprising error message if the directory does not exist, and if we have no strong reasons to automatically write files outside the current directory, I'd prefer if we moved away from the corresponding functionality.

As to limiting things to ASCII printables... well, I suppose in pyVO we have to assume that the service operators are well-meaning; fending off possible attacks from services (of which there are many) is beyond our means. For instance, I would not forbid file names with leading dots (that will produce hidden files on unix-like machines) and not even files called "-r *" (which may have disastrous consequences for unwary users).

Still, allowing arbitrary characters in opens up a large trove of problems. For instance, you may produce file names that our users cannot type, or that will mess up directory displays, in particular when the names don't happen to be good utf-8. Also, different names may look the same, depending on the choices of glyphs, but even worse when the terminal font is missing glyphs and all you see is a series of non-informational fallbacks.

In short: I'm sure we should map the titles we get to ASCII. The trouble is to decide how. Just replacing all non-ASCII with an underscore is fast and foolproof but will lead to many collisions (which we already handle, so it's annoyance more than a problem, but still).

What worked reasonably well for me in other settings: Replacing all non-ASCII c with

html.entities.codepoint2name.get(c, "_")[0]

For instance, an ä will become an a, and so will å or á or even α.