EmilStenstrom / conllu

A CoNLL-U parser that takes a CoNLL-U formatted string and turns it into a nested python dictionary.
MIT License
311 stars 50 forks source link

Added the ability to read variables from a comment #13

Closed vadim-isakov closed 6 years ago

vadim-isakov commented 6 years ago

Thank you for the wonderful library.

Files can contain variables as comments. I propose a function for reading variables.

EmilStenstrom commented 6 years ago

Hi! Thanks for making a pull request!

I have a couple of questions before deciding on this:

  1. I don’t see any variables in the example you added to README.md. Did you add the wrong example?
  2. Is there a specification somewhere of how variables work? I’d like to check that we don’t miss any corner cases.
EmilStenstrom commented 6 years ago

Ah, I see the second commit now. That answers the first question. Do you have an answer for the second?

vadim-isakov commented 6 years ago

Hi! There is a specification on the site http://universaldependencies.org/format.html in the following paragraphs: "Untokenized Text", "Sentence Boundaries and Comments", "Paragraph and Document Boundaries".

Important points:

It seems as comments can be without a variable so I made the new commit where program works, if there are comments without variables (metadata) but we don't extract them.

P.S. I often find conllu files with different variables.

Update: Made a new commit in which I replaced word 'variables' to 'metadata' it is so mentioned in the specification and corrected the example of the answer in the README for clarity.

EmilStenstrom commented 6 years ago

Thanks for the references! I do want this in the library. I’m a little sure about adding this as a separate entry point, instead of it being part of the parse and parse_tree functions. I’m currently away on vacation, but I’ll have a look at this in a couple of days when I’m back again.

EmilStenstrom commented 6 years ago

Hi! After thinking about this code a bit more, I don't like the fact that there is now two different entry points, parse and parse_with_comments. I'd like parse to include comments as well. But the list returned by parse isn't easily extended to include comments without breaking backwards compatibility, so that's the catch.

I've played around with the code in a separate branch a bit, and I think I have an idea, but I'll do that separately.

So, the plan is to include this PR first, and then separately work on a nicer API.

To get this merged, I wanted the tests in Travis to pass first. That required proper formatting according to flake8, so I fixed that, and 100% test coverage, so I added that too. As far as I'm concerned, this is ready to be merged.

EmilStenstrom commented 6 years ago

Conllu 0.11 is now released to PyPi: https://pypi.org/project/conllu/

EmilStenstrom commented 6 years ago

Just a heads up. I just released conllu 1.0, which implements the changes I described above. Now comments are always parsed when using the normal parse function, and you can get them by accessing the .metadata attribute on the returned TokenList or TokenTree. Hope I didn't mess too much with your current code :)