IAU-ADES / ADES-Master

ADES implementation based on a master XML file
26 stars 7 forks source link

Support stdin/stdout for Python conversion tools #44

Closed stevenstetzler closed 5 months ago

stevenstetzler commented 6 months ago

@stevechesley

The current Python conversion tools support reading and writing from/to files on disk, but it might be advantageous to read/write from/to stdin and stdout to support conversion pipelines. Currently, a pipeline to create an MPC 80 column obs file, convert it to XML and then do some processing on the output might look like:

$ create_mpc obs.mpc
$ mpc80coltoxml.py obs.mpc obs.xml
$ parse_xml obs.xml

Using stdin/stdout piping, this pipeline can instead be:

$ create_mpc | mpc80coltoxml.py | parse_xml 

I've taken a look already, and it should be feasible to support this functionality for all of the conversion tools. The xmlto*.py and *toxml.py scripts already support

<program> <in> <out>

signatures, so it will be straightforward to also implement a

<in> | <program> 

signature. However, an issue will come up in trying to convert mpc80coltoxml.py. The mpc80coltoxml.py script currently supports two program signatures/modes of operation:

mpc80coltoxml.py <in> <out> # convert <in> to <out>
mpc80coltoxml.py <in>       # check validity of <in>    

I think we can support all of

mpc80coltoxml.py <in> <out> # convert <in> to <out>
mpc80coltoxml.py <in>       # check validity of <in> 
<in> | mpc80coltoxml.py     # convert <in> to <out>

retaining backwards compatibility, although there will be one ambiguity/confusion with the new functionality. This signature

<in> | mpc80coltoxml.py <out>

will check the validity of <out> and ignore <in>, which might not be expected behavior from the user. The behavior exists since this signature can't be disambiguated (in number of arguments) from

mpc80coltoxml.py <in>

Should the functionality of

mpc80coltoxml.py <in>       # check validity of <in> 

be retained in this case?

Additionally, should reading/writing from/to stdin/stdout be supported in the Python API? If so, I will need to modify/extend the Python API to handle open files/streams instead of just filenames. Does backwards compatibility (passing strings) need to be maintained here as well?

stevechesley commented 6 months ago

@stevenstetzler I'm not sure there is a serious use case for the validation-only mode for mpc80coltoxml.py, and I'm doubtful that it is actually being used. So let's proceed with the following approach, which preserves the functionality and seems rather more intuitive:

mpc80coltoxml.py <in> <out>      # convert <in> to <out>
mpc80coltoxml.py <in>            # convert <in> to xml on STDOUT
mpc80coltoxml.py                 # convert obs80 on STDIN to xml on STDOUT
mpc80coltoxml.py --val-only      # check validity of obs80 on STDIN but do not create XML file
mpc80coltoxml.py --val-only <in> # check validity of <in> but do not create XML file
mpc80coltoxml.py --val-only <in> <out> # Error: cannot request output file for --val-only.

Here, I'm assuming that the input validation happens in all variants.

Does this seem workable?

ldenneau commented 6 months ago

Hi,

If you're using Linux, you can compose stdin/stdout pipelines using the existing mpc80coltoxml.py file-based interface:

Files

./mpc80coltoxml.py test.mpc test.xml

Pipes

stdin

cat test.mpc | ./mpc80coltoxml.py /dev/stdin test.xml

stdout

./mpc80coltoxml.py test.mpc /dev/stdout | do_something

both

cat test.mpc | ./mpc80coltoxml.py /dev/stdin /dev/stdout | do_something

Larry

On Sun, Mar 3, 2024 at 4:43 PM stevenstetzler @.***> wrote:

@stevechesley https://urldefense.com/v3/__https://github.com/stevechesley__;!!PvDODwlR4mBZyAb0!Vua1QTS9QPOvFfIw8_FrbrJ5DmL9vrWF8y_fFcK_Zo99ICiI1rqE9QL8U_9whKlCA7lmXQRueh6C-SdjGQbGtf4j$

The current Python conversion tools support reading and writing from/to files on disk, but it might be advantageous to read/write from/to stdin and stdout to support conversion pipelines. Currently, a pipeline to create an MPC 80 column obs file, convert it to XML and then do some processing on the output might look like:

$ create_mpc > obs.mpc $ mpc80coltoxml.py obs.mpc obs.xml $ parse_xml obs.xml

Using stdin/stdout piping, this pipeline can instead be:

$ create_mpc | mpc80coltoxml.py | parse_xml

I've taken a look already, and it should be feasible to support this functionality for all of the conversion tools. The xmlto.py and toxml.py scripts already support

signatures, so it will be straightforward to also implement a | signature. However, an issue will come up in trying to convert mpc80coltoxml.py. The mpc80coltoxml.py script currently supports two program signatures/modes of operation: mpc80coltoxml.py # convert to mpc80coltoxml.py # check validity of I think we can support all of mpc80coltoxml.py # convert to mpc80coltoxml.py # check validity of | mpc80coltoxml.py # convert to retaining backwards compatibility, although there will be one ambiguity/confusion with the new functionality. This signature | mpc80coltoxml.py will check the validity of and ignore , which might not be expected behavior from the user. The behavior exists since this signature can't be disambiguated (in number of arguments) from mpc80coltoxml.py Should the functionality of mpc80coltoxml.py # check validity of be retained in this case? — Reply to this email directly, view it on GitHub , or unsubscribe . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
stevenstetzler commented 5 months ago

@stevechesley I've opened a PR (#48) that implements support for stdin/stdout reading and writing.

Because the existing Python interface supports file names, I essentially wrote a wrapper around the input/output streams that utilizes temporary files at the point of command line argument parsing. If a conversion tool reads from stdin, it will write the stream to a temporary file and then pass the name of that temporary file to the internal Python conversion function. If it writes to stdout, then it will write first to a temporary file, pass the temporary file name to the conversion function, and then write the contents of that file to stdout.

I looked into whether the internal Python API can support reading/writing from/to streams in addition to file names Unfortunately, it became tricky to think about and handle in the general case, as we can't assume that a given conversion tool reads from/writes to a stream once and where we need to mix and match input/output encodings. When dealing with file names, one simply re-opens the file to read it again, and this is what is done internally at present. When dealing with file streams, one needs to seek to the beginning of the open file. I was able to implement a wrapper around the open builtin function that the internal API can use and that can handle this difference in behavior, but then got I stuck when dealing with stdin which is a non-seekable stream. If we want to read from stdin twice, then I either needed to implement a stream with in-memory history or fall back on using a file stream pointing to a file on disk. The latter seemed easier, and since it relies on a temporary file on disk, it becomes equivalent to using just file names as input to the Python conversion tools, so I ended up taking that approach.

There are two places I noticed we read from an input file twice: 1) In validation, to check the existence of the XML preamble: https://github.com/IAU-ADES/ADES-Master/blob/python_pipes/Python/bin/valall.py#L52-L58 2) And in mpc80coltoxml.py which parses the input a second time when an error occurs: https://github.com/IAU-ADES/ADES-Master/blob/python_pipes/Python/bin/mpc80coltoxml.py#L1087-L1093

I think we can work around (2) by re-implementation: the second call is just to perform syntax checking on the input in the case an error fails, but this can be done more elegantly.

For (1), there may actually be no solution since checking the existence of the XML preamble consumes that line from the file. If we read the XML file into memory with lxml before this check, no preamble will ever be found. If we check before reading with lxml, then the XML parser loses access to the contents of the preamble. I implemented the check this way because the in-memory parsed XML tree did not provide information about the existence/contents of the preamble.

I think unless it's really desirable to handle stdin/stdout and/or open file streams in the Python API and not just at the command line, then I think it's better to leave the API as is. If it is desirable, then I can keep looking into how to implement this well and we can try to work around the specific cases above where we need to read the stream twice.

stevenstetzler commented 5 months ago

@ldenneau thanks for your suggestion and pointer. That's right that on Unix systems we can reference /dev/stdin and /dev/stdout as files to achieve this goal of reading from stdin and writing to stdout. Unfortunately, this won't have portability over to other operating systems like windows, so I looked into a Python-based option. And also this won't provide control over the encoding that Python uses to convert the byte stream from stdin. This will work fine in most cases on most Unix systems though as UTF-8 seems to be the typical default when Python opens files on these systems and is seemingly the standard text encoding scheme throughout ADES tools and elsewhere.

stevechesley commented 5 months ago

Resolved with PR #48