astropy / astroquery

Functions and classes to access online data resources. Maintainers: @keflavich and @bsipocz and @ceb8
http://astroquery.readthedocs.org/en/latest/
BSD 3-Clause "New" or "Revised" License
705 stars 397 forks source link

Some astroquery modules can lead to a Remote Code Execution vulnerability #2777

Open Pharisaeus opened 1 year ago

Pharisaeus commented 1 year ago

In case you want to try to find and exploit the vulnerability yourself see the CTF challenge before reading rest of this ticket: https://hack.cert.pl/challenge/astrology The application to downloads some files with ALMA astroquery module, and it's enough to give attacker a Remote Code Execution under certain conditions.

Warning: spoilers ahead!

The exploitation chain described here is specific to ALMA astroquery module, however few other modules contain similar code and might also be vulnerable. ALMA was picked because it provides a simple way to fulfil some of the attack scenario prerequisites.

Stage 1 The attack requires convincing user to use attacker-controller URL. This is especially interesting for ALMA since there are multiple mirrors and it's not suspicious to see different valid URLs. There is no validation of the dataarchive_url to make sure it's a legitimate one, nor there is any kind of certificate-pinning to avoid MITM, so it is possible to use astroquery ALMA module with attacker-controlled URL.

Stage 2 Once astroquery is connected to attacker-controller URL all requests will go there, including TAP queries and file download requests. The vulnerability is located in the download_files function:

            try:
                filename = re.search("filename=(.*)",
                                     check_filename.headers['Content-Disposition']).groups()[0]
            except KeyError:
                log.info(f"Unable to find filename for {file_link}  "
                         "(missing Content-Disposition in header).  "
                         "Skipping to next file.")
                continue

            if savedir is not None:
                filename = os.path.join(savedir,
                                        filename)

Notice that it's implicitly assumed that filename is really just a filename, but if this header contains a relative path with some ../ components, then the result of os.path.join can jump outside of savedir and point to any path. This means that downloading a file from attacker-controlled URL can lead to arbitrary file write.

Stage 3 Arbitrary file write in itself can lead to data loss, but without root privileges can't be immediately escalated into code execution. Fortunately (or not) astroquery can help us with that. There is a feature of astroquery, enabled by default (at least for ALMA), which allows to turn arbitrary file write into remote code execution -> cache=True Astroquery caches HEAD requests into a pickle files in predictable locations. Pickle is not simple serialized data format (like json or xml) but rather a bytecode for a simple stack-based virtual machine. Loading a pickle allows to execute any python code we want. We can use the arbitrary file write from the content-disposition parsing bug to create (or overwrite) a pickle for some specific file with malicious content, and once user attempts to download this file the pickle will be loaded and code executed. This step requires attacker to know the astroquery cache directory location, which by default requires knowing the username to find the home dir location.

Suggestions:

  1. It might be worth to consider at least some verification of the URLs or even certificate pinning.
  2. Parsing Content-Disposition header clearly can be tricky, so it might be better to use some standard (and bug-free) implementation, instead of having each module have a its own.
  3. Pickle is a very dangerous choice for serialization mechanism and should be avoided unless absolutely necessary, and I'm not sure if this should be enabled by default at all.
keflavich commented 1 year ago

Given that a user must explicitly set a malicious URL in order for this exploit to occur, is it really a threat at all? One could just as easily use requests with an arbitrary URL and end up at the same problem.

The concern about pickling as the choice of caching format seems reasonable, though. Perhaps we'd be better off using a different module to dump & load caches. Only a few lines would need to be changed:

https://github.com/astropy/astroquery/blob/016f6b4da62495fef4314ad6f74ef0f0b56ad87f/astroquery/query.py#L38 https://github.com/astropy/astroquery/blob/016f6b4da62495fef4314ad6f74ef0f0b56ad87f/astroquery/query.py#L106 https://github.com/astropy/astroquery/blob/016f6b4da62495fef4314ad6f74ef0f0b56ad87f/astroquery/query.py#L124

Pharisaeus commented 1 year ago

Given that a user must explicitly set a malicious URL in order for this exploit to occur, is it really a threat at all?

Since there is no URL validation or certificate pinning, user might not realise this is the case (eg. a MITM scenario or malicious DNS, which are not difficult to achieve if someone is using an open wifi network for example). I'm not saying astroquery has to address this, however certificate pinning is a very common practice nowadays to avoid such issues in mobile applications. But I understand that this comes at added maintenance cost to keep track of expiration dates. But a simple check if the URL matches a well-known ALMA node and is pointing to https could already help.

One could just as easily use requests with an arbitrary URL and end up at the same problem.

No, this is not true. The invalid handling of Content-Disposition is still on astroquery side. In normal circumstances just downloading file from some URL does not result in arbitrary file write. If you download such file with curl or wget they will take only the filename stripping any path elements. According to the relevant RFC https://www.rfc-editor.org/rfc/rfc6266#section-4.3 it's on the recipient side (in this case the "client" which is astroquery) to make sure this is handled properly.

keflavich commented 1 year ago

For the content-disposition handling, at least, wrapping that in os.path.basename() would solve the relative path issue, no? Or do you have a recommendation on how to fix that particular problem?

I think you have identified two independent issues in different locations:

  1. possible arbitrary file write, which we protect against in some cases because _download_file checks for existing files first: https://github.com/astropy/astroquery/blob/016f6b4da62495fef4314ad6f74ef0f0b56ad87f/astroquery/query.py#L358 but indeed this can result in overwriting existing files.
  2. pickling could result in arbitrary code execution

I think these are both worth addressing. URL and certificate validation could also be useful, but I think is a lower priority.

Pharisaeus commented 1 year ago

Yes, I believe doing basename before join on the paths is enough to solve the path traversal issue and protect against arbitrary file write. Keep in mind that searching 'Content-Disposition' in the astroquery code shows few more hits which also look very similar, so this is not necessarily only ALMA module problem. Checking for overwriting existing files is not enough, since you can just as well create a pickle for some "future" file as long as you know the name ;)

I think you have identified two independent issues in different locations

Yes, those are unrelated issues, but they could be nicely chained into a single exploit and resulted in an interesting CTF problem :)

keflavich commented 1 year ago

Just to clarify, couldn't pickling result in arbitrary code execution without the content-disposition issue? I'm not seeing a reason the file write location affects whether or not the pickles contain malicious code. You would need to have a MITM or malicious host URL to get the bad pickle in the first place, but I think you don't need all three steps to produce arbitrary code execution, right?

Also for my own sake, I had to google CTF - I was unaware of cybersecurity capture-the-flag competitions before today.

Pharisaeus commented 1 year ago

Just to clarify, couldn't pickling result in arbitrary code execution without the content-disposition issue?

Of course it could, however attacker has to somehow place malicious pickle in the right location first! So you do need this multi-stage approach, because the path traversal and arbitrary file write vulnerability allows attacker to place the pickle payload in the cache. Astroquery does not load pickles from random places, only those from the cache.

Without the ability to inject a malicious URL you would not be able to do anything really (unless you could get malicious file into ALMA archive) Without the content-disposition issue you would have no way to write anything to the cache, at best you could just download some files to savedir. Without the pickle issue you would have trouble escalating arbitrary file write into code execution (with enough permissions maybe you could try to overwrite ~/.ssh/authorized_keys, but in general it's a non-trivial thing)

So you do need all those steps :)

Anyway, they challenge is still up (there is also a dockerfile if you want to run locally) so you can try to exploit it and see what is needed and what is not.

keflavich commented 1 year ago

Without the content-disposition issue you would have no way to write anything to the cache, at best you could just download some files to savedir.

This is the part I'm not following. You could trigger writing files to the default cache directory as pickles, then when you re-execute the same query a second time, the cache file is loaded. Why does it matter where the file is written if it contains executable code?

Pharisaeus commented 1 year ago

I really encourage you to try to write this exploit and it will all be clear ;)

You could trigger writing files to the default cache directory as pickles

Astroquery saves pickled HEAD responses into the cache directory. You can't place arbitrary or malicious contents this way, because this response has to be a valid HTTP response. The downloaded files normally go into user-selected savedir which is not the cache location, and as attacker you don't control this. As a result in normal circumstances you can't write malicious pickles into the cache directory. Cache directory contains pickled HEAD and savedir contains downloaded files.

Astroquery loads pickles from the cache location, not from any other location. It does not load pickles from the savedir where files are saved. As a result in normal circumstances you can't load a downloaded malicious pickle, because the downloaded file is in a wrong directory.

This is where the content-disposition vulnerability comes into play - it allows you to save the downloaded file in any location you want. We're no longer constrained by savedir and specifically we can now save the file into cache directory, from where pickles can be loaded.

Why does it matter where the file is written if it contains executable code?

It matters because astroquery will only load pickles from the cache directory. File written in a different location will not get executed.

edit: Maybe I understand the confusion if you're not familiar with how pickles work. If you create the pickle with some pickle.dumps(...) then you know what's inside and it's safe to load it. So the pickles saved by astroquery are not "dangerous" because the only "executable code" they contain is the code to reconstruct the HTTP Response object. The issue starts when you're loading pickles created by someone else - those can contain malicious code and are not safe to load. In this attack we send such malicious pickles as "downloaded files" and use the arbitrary file write vulnerability to place them in the cache directory, so they can be loaded by astroquery.

keflavich commented 1 year ago

Yes, I get the point now. I hadn't realized both that the dumps'd things were safe, though that makes sense, and that astroquery only loads pickles that have been dumped. We do cache other file types, but we never try to open them as pickles, so that's safe. Thanks.

Pharisaeus commented 1 year ago

and that astroquery only loads pickles that have been dumped

To be precise astroquery loads any pickles which happen to be in the cache directory and have specific name matching sha224. In normal circumstances indeed they are safe pickles dumped by astroquery, but as shown in this ticket, it's not always the case.

but we never try to open them as pickles, so that's safe.

Are those other cached files stored with arbitrary filenames? Or does astroquery mangle the names? Because if those other cached files can have any filename, then it's the same issue again -> someone could trick astroquery to cache a file with specially crafted name to match some archive file sha224, and in such case astroquery would open this file thinking it's a pickled HEAD.

keflavich commented 1 year ago

I can handle some of this

Alright, a checklist for the filename exploit:

PR in progress: https://github.com/astropy/astroquery/pull/2779

Pharisaeus commented 1 year ago

esasky: already checks for file extension, so there should be no easy exploit.

It's true that _extract_file_name_from_response_header only allows for specific extensions, however it still allows for path traversal. So while it's not possible to drop a pickle, it's still possible to overwrite some files, assuming they have the right extension. Definitely not critical, but it might be nice to make a basepath there anyway.

ceb8 commented 1 year ago

@keflavich @bsipocz You want me to change the caching mechanism from pickle to something safer yes?

keflavich commented 1 year ago

Yeah, I think that's best if we can find an appropriate alternative solution.