MikkelSchubert / paleomix

Pipelines and tools for the processing of ancient and modern HTS data.
https://paleomix.readthedocs.io/en/stable/
MIT License
43 stars 19 forks source link

Please try to drop code copy of pyyaml #11

Closed tillea closed 4 years ago

tillea commented 7 years ago

Hi, I'm working on Debian packages for paleomix. The packaging is done in the Debian Med team which is packaging free software in life sciences for official Debian. When doing the packaging I was stumbling upon a code copy of pyyaml which has some slight changes. If these changes are relevant for paleomix it would be better to file an issue to official pyyaml to merge these changes rather than maintaining an code copy. Kind regards, Andreas.

MikkelSchubert commented 7 years ago

Hi Andreas, The PyYAML code is included in order to address two issues, namely mis-parsing of numbers in the format 5e7 (scientific notation without decimals) and the silent overwriting of duplicate keys. Both of these issues have been reported in the PyYAML bug-tracker, 5 and 8 years ago, respectively: http://pyyaml.org/ticket/246 http://pyyaml.org/ticket/128

It looks like the first issue is fixed in the ruamel.yaml fork of PyYAML, and the second issue can be caught using a custom loader. However, ruamel.yaml also drops support for some implicit types used in paleomix YAML files, so switching to that will require some non-trivial work to handle. I'll try to take a stab at that at some point.

Cheers

tillea commented 7 years ago

Hi Mikkel, thanks for your prompt response.

On Thu, Feb 23, 2017 at 12:21:24AM -0800, Mikkel Schubert wrote:

The PyYAML code is included in order to address two issues, namely mis-parsing of numbers in the format 5e7 (scientific notation without decimals) and the silent overwriting of duplicate keys. Both of these issues have been reported in the PyYAML bug-tracker, 5 and 8 years ago, respectively: http://pyyaml.org/ticket/246 http://pyyaml.org/ticket/128 OK. It looks like the first issue is fixed in the ruamel.yaml fork of PyYAML, and the second issue can be caught using a custom loader. However, ruamel.yaml also drops support for some implicit types used in paleomix YAML files, so switching to that will require some non-trivial work to handle. I'll try to take a stab at that at some point. I admit I have no idea about the goals of ruamel.yaml. I can confirm that it is available as Debian package as well. May be it might make sense to contact its authors to keep the support of the implicit types you need? Kind regards, Andreas.

MikkelSchubert commented 7 years ago

Hi Andreas, The "problem" is that ruamel is based on YAML 1.2, which no longer accepts values 'yes', 'no', 'on', 'off', as boolean values (and possible other types). This is then something that I'll have to handle myself, since standard paleomix YAML file do make use of such values, and I do not consider breaking most existing paleomix YAML files (for no real benefit) to be acceptable .

Cheers

tillea commented 7 years ago

hh, thanks for the explanation.

AvdN commented 6 years ago

Yes and No are no longer booleans in 1.2 and scientific numbers without mantissa are not part of 1.1. So you would have a problem with ruamel.yaml if it supported only YAML 1.2, but since it also supports 1.1 (even without needing the %YAML 1.1 tag, and only issues a (surpressable) warning for the scientific numbers you can do:

import warnings
import ruamel.yaml

warnings.simplefilter('ignore', ruamel.yaml.error.MantissaNoDotYAML1_1Warning)

yaml = ruamel.yaml.YAML(typ='safe', pure=True)
yaml.version = (1, 1)

data = yaml.load("""\
- Do you want booleans?: 
   - Yes
   - Off
- scientific numbers like: 5e7
""")

print(data)

to get [{'Do you want booleans?': [True, False]}, {'scientific numbers like': 50000000.0}]

If you reuse a key in a dict, you get a DuplicateKey warning (suppressable to give your users time to correct their files).

As for the goals of ruamel.yaml: I only wanted to make a comment preserving Loader/Dumper, but my intitial PR for remerging the Python2 and Python3 copies of PyYAML got no response and the code base is still 99% duplicated. So I moved forward merging Python2 and 3 code, going over all the bugs reported on pyyaml.org and all the bugs and PRs against the repository on Bitbucket and fixing as many as I could. Then I incorporated the C source for libyaml and fixed the most notorious bugs reported on that.

My main focus has been on the round-trip support (preserving comments, aliases etc), but I fixed many issues in the traditional (Safe)Loader, that are still are part of the PyYAML code base. (Note that the move of pyyaml to github did only take the issues on bitbucket along, not the few hundred issues that were still on pyyaml.org)

tillea commented 6 years ago

Hi Anthon. thanks for that detailed explanation. Without understanding all the details I read it like: You tried hard to get rid of the code copy but it is not yet possible to do this. Is this correct? Kind regards, Andreas.

AvdN commented 6 years ago

@tillea As far as I understand it the "code copy" refers to the yaml subdirectory, which you originally asked to get rid off. I did not try to get rid of that code. I only tried to explain that you could use ruamel.yaml as an alternative for that code copy, and how to deal with the two problem that Mikkel did foresee. It is up to him to get rid of the code, and using ruamel.yaml and I think it will work given the details I provided (but he is the one to test that).

(IIRC ruamel.yaml is available as a debian package, so that should make life more easy).

I am of course available to help out if there are questions/problems.

tillea commented 6 years ago

Thanks a lot for the clarification, Andreas.