Closed vadim-isakov closed 6 years ago
Hi! Thanks for making a pull request!
I have a couple of questions before deciding on this:
Ah, I see the second commit now. That answers the first question. Do you have an answer for the second?
Hi! There is a specification on the site http://universaldependencies.org/format.html in the following paragraphs: "Untokenized Text", "Sentence Boundaries and Comments", "Paragraph and Document Boundaries".
Important points:
It seems as comments can be without a variable so I made the new commit where program works, if there are comments without variables (metadata) but we don't extract them.
P.S. I often find conllu files with different variables.
Update: Made a new commit in which I replaced word 'variables' to 'metadata' it is so mentioned in the specification and corrected the example of the answer in the README for clarity.
Thanks for the references! I do want this in the library. I’m a little sure about adding this as a separate entry point, instead of it being part of the parse and parse_tree functions. I’m currently away on vacation, but I’ll have a look at this in a couple of days when I’m back again.
Hi! After thinking about this code a bit more, I don't like the fact that there is now two different entry points, parse and parse_with_comments. I'd like parse to include comments as well. But the list returned by parse isn't easily extended to include comments without breaking backwards compatibility, so that's the catch.
I've played around with the code in a separate branch a bit, and I think I have an idea, but I'll do that separately.
So, the plan is to include this PR first, and then separately work on a nicer API.
To get this merged, I wanted the tests in Travis to pass first. That required proper formatting according to flake8, so I fixed that, and 100% test coverage, so I added that too. As far as I'm concerned, this is ready to be merged.
Conllu 0.11 is now released to PyPi: https://pypi.org/project/conllu/
Just a heads up. I just released conllu 1.0, which implements the changes I described above. Now comments are always parsed when using the normal parse function, and you can get them by accessing the .metadata attribute on the returned TokenList or TokenTree. Hope I didn't mess too much with your current code :)
Thank you for the wonderful library.
Files can contain variables as comments. I propose a function for reading variables.