IAU-ADES / ADES-Master

ADES implementation based on a master XML file
26 stars 7 forks source link

Support stdin/stdout for Python command line conversion tools #48

Closed stevenstetzler closed 5 months ago

stevenstetzler commented 5 months ago

This PR allows Python conversion tools (files named *to*.py) to read from stdin and write to stdout. Additionally, it enforces input/output text encodings where they were previously not enforced. Tests are added in test_piping.py. Addresses issue #44 .

stevenstetzler commented 5 months ago

@stevechesley I've added some comments to aid in your review of this code change. The main change here is to generalize the definition and execution of the conversion utilities (*to*.py) when used via the command line (i.e. argument parsing) since they all behave something like *to*.py <input> <output>. Since the existing Python functions that do the conversion operate on filenames, this code also adds a utility to automatically wrap stdin/stdout using temporary files and call one of the conversion utilities. You'll see that for each *to*.py file, when the script is executed (under if __name__ == '__main__'), they now all do something like: 1) Create an input/output argument parser 2) Wrap their internal *to* Python function into a function that has a signature f(input, output) 3) Calls convertutility.call_with_files to execute this wrapped function, automatically converting stdin/stdout to temporary files/filenames.

I think this is close to the best we can do if we don't want to change the internals of any of the conversion script. There are two extensions we can take from here: 1) Instead of using temporary files on disk, create temporary I/O text buffers in-memory. This can be done with io.StringIO objects. 2) Modify the internals of each of the *to* Python conversion functions to operate on seekable file streams and filenames. I would add a new function to convertutility.py that handles both creating a new file stream by opening a filename or re-opening an existing file stream. I think we would still have to use in-memory I/O buffers or temporary files to handle stdin since we may need to read the input stream twice and stdin is a non-seekable stream. But in essence this would allow the user to do either

import mpc80coltoxml
mpc80coltoxml("input.obs80", "output.xml")

or

import mpc80coltoxml
with open("input.obs80", "r") as i, open("output.xml", "w") as o:
    mpc80coltoxml(i, o)

I'm not sure how much utility this would actually have, except maybe to define conversion pipelines in pure Python and using in-memory operations. For example, a script like this becomes possible:

import io
import mpc80coltoxml
import xmltojson
obs80 = open("input.obs80")
xml = io.StringIO() # in-memory text buffer
mpc80coltoxml(obs80, xml)
xml.seek(0) # seek to beginning of in-memory xml text contents
json = open("output.json")
xmltojson(xml, json) # the middle-man XML is never written to disk

Finally, the existing code seems to only loosely adhere to defining and using an input/output text encoding, so I've added in a few spots where I noticed the code wasn't explicitly defining it.

Let me know what you think of this code change, if you think it is adding useful functionality or adding confusion to the tools. Also like we discussed before, I think we can wait a bit to merge and stew over the desired functionality and implementation.