arademaker / hs-conllu

CoNLL-U/UD library
GNU Lesser General Public License v3.0
3 stars 3 forks source link

Full validation feature #34

Open funarog opened 2 years ago

funarog commented 2 years ago

Is there a plan to incorporate the same level of validation performed by the Universal Dependency tool, validate.py, found here https://github.com/UniversalDependencies/tools? Also Is there a plan to do a performance comparison between the official Universal Dependency validation python script and hs-conllu?

The universal dependency organization's validation software provides 5 levels of validation. Because it is written in python, I suspect it would be slower than written in Haskell or C++.

odanoburu commented 2 years ago

Hi @funarog!

There is no such plan, no :/ I'm afraid that having a competing implementation without a formal specification of UD constraints would lead to countless mismatches between them, not the mention the duplication of effort!

I do have some related plans; my biggest problem with UD validation is not slowness, but that it is not declarative. Adding new rules is likely hard to newcomers, and I'd wager the validation script could be made easier to maintain. But I'm curious: is slowness such a great problem for you? My plans might help with that too, and if you have a concrete problem that might motivate me to actually work on it :P (and perhaps you are even willing to collaborate on it!)

funarog commented 2 years ago

Bruno Cuconato

Thank you for responding. I am interested in collaboration. My primary interest is in creating a mapping between syntactic structure and the underlying semantic structure; linguistic mappings in general.

I agree there is not a formal specification of UD constraints. I am currently reciewing the UD tool validate.py and UD web page ( https://universaldependencies.org/validation-rules.html ) to determine if the rules can be more declarative. That would allow a parser generator to be built. I plan is to contact the UD developers and find out if that can be done.I believe it would be beneficial to the computational linguistic community.

I agree with your statement: Adding new rules is likely hard to newcomers, and I'd wager the validation script could be made easier to maintain. It would also make it easier for the computational linguistic community to have easier access to the consul formats.

Performance was just a curiousity. If a parser generator could easily be built one could easily use c++ Spirit and determine the benefits of different libraries/ languages. Performance probably matters to the UD contributors, since it appears they validate every language treebank on a regular bases.

I am slowly getting familiar with Haskell, though I have been programming for 45 years.

Keep in touch and we can determine where are interests intersect.

By the way, I find hs-conllu very useful. I had started writing my own parser when I ran across hs-conllu. Which I have been reviewing the code. If I run across some suggested improvementsI will let you know.

Thanks again.

Greg Funaro

On Oct 21, 2021, at 7:41 AM, bc² @.***> wrote:

Hi @funarog!

There is no such plan, no :/ I'm afraid that having a competing implementation without a formal specification of UD constraints would lead to countless mismatches between them, not the mention the duplication of effort!

I do have some related plans; my biggest problem with UD validation is not slowness, but that it is not declarative. Adding new rules is likely hard to newcomers, and I'd wager the validation script could be made easier to maintain. But I'm curious: is slowness such a great problem for you? My plans might help with that too, and if you have a concrete problem that might motivate me to actually work on it :P (and perhaps you are even willing to collaborate on it!)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

funarog commented 2 years ago

Bruno Cuconato I made some modifications to hs-conllu. Specifically, I converted String to Text and some design changes. I got about 137% improvement in speed and about 20% reduction in memory. Tried to push the changes into a new branch stringToText but ran into problems

odanoburu commented 2 years ago

hi @funarog , that's great! there are certainly design changes to be made, and it's cool you got a performance improvement :)

I don't think you can push to my version of the repository directly, so you'd have to push to your mirror/fork and make a pull request, see https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request

funarog commented 2 years ago

Okay

I forked and pushed the changes to here: https://github.com/funarog/hs-conllu/blob/stringToText/src/Conllu/Print.hs https://github.com/funarog/hs-conllu/blob/stringToText/src/Conllu/Print.hs

It looks to me there is a pull request here: https://github.com/odanoburu/hs-conllu/pulls https://github.com/odanoburu/hs-conllu/pulls

There were a few design changes easily seen with diff.

On Jun 27, 2022, at 7:01 AM, bc² @.***> wrote:

hi @funarog https://github.com/funarog , that's great! there are certainly design changes to be made, and it's cool you got a performance improvement :)

I don't think you can push to my version of the repository directly, so you'd have to push to your mirror/fork and make a pull request, see https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request — Reply to this email directly, view it on GitHub https://github.com/odanoburu/hs-conllu/issues/34#issuecomment-1167264517, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFB7CS7PT7ANWIE77TLJWQLVRGJ2NANCNFSM5GMEG4UA. You are receiving this because you were mentioned.

odanoburu commented 2 years ago

Hi @funarog , I don't think you've managed to create the pull request, although the fork was successful and the changes have been committed. I could pull the changes myself, but if you do open the PR I can then comment the changes on a line-by-line basis, because some things I disagree with (e.g. removing the Enum instance from EP), and others I don't really understand (and would like to ask about)

funarog commented 2 years ago

Bruno

I will attempt the pull request again; probably later this evening.

Regarding changes. My ultimate goal is to write a validating parser with the capabilities of the python validating parser used by the ConLLU formatting group found here: https://github.com/UniversalDependencies/tools/blob/master/validate.py https://github.com/UniversalDependencies/tools/blob/master/validate.py and with improved speed. As you already know simply converting the hs-connlu code from string to text will not guarantee a speed up. I was pleased to see a speed up. But I did not do an in depth performance analysis to see from where it came.

Any changes I made 1) were done so that I could understand better what the hs-conllu code was doing 2) remove warnings or errors 3) try to follow recommended haskell practices.

I would not change code simple for the sake of changing code. The changes I made, made it easier for me to follow and understand the code. I am not a Haskell expert by any means. (though I have been programming in the large for about 50 years in various languages). I removed the items like enum because it did not seem to impact the code, And, for the case of EP, I followed LYAHFGG: Enum members are sequentially ordered types — they can be enumerated. The main advantage of the Enum typeclass is that we can use its types in list ranges. They also have defined successors and predecesors, which you can get with the succ and pred functions.

Since this did not seem the case, I removed the Enum derivation.

If there is something I missed or a benefit that I am probably not aware, I would appreciate any insights you may have.

Thanks

Greg

On Jun 29, 2022, at 7:55 AM, bc² @.***> wrote:

Hi @funarog https://github.com/funarog , I don't think you've managed to create the pull request, although the fork was successful and the changes have been committed. I could pull the changes myself, but if you do open the PR I can then comment the changes on a line-by-line basis, because some things I disagree with (e.g. removing the Enum instance from EP), and others I don't really understand (and would like to ask about)

— Reply to this email directly, view it on GitHub https://github.com/odanoburu/hs-conllu/issues/34#issuecomment-1169945741, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFB7CS4OBSR44UXDD7RJIO3VRRBV5ANCNFSM5GMEG4UA. You are receiving this because you were mentioned.

odanoburu commented 2 years ago

My ultimate goal is to write a validating parser with the capabilities of the python validating parser used by the ConLLU formatting group found here

I see. That's an endearing goal; is it really no up to speed? I would be happy to merge any speed improvements on the parser and better design changes, but I don't think I'd want to extend the scope of the library to this level of validation. (You can of course use this library as a dependency of a new one.)

I would not change code simple for the sake of changing code.

Of course not :) But if we disagree we'll have to discuss before merging anything, and that's when the pull request interface shines! Sometimes that's too much trouble, feel free to use your fork as a dependency instead, and if after some time your library is clearly better and with a similar scope I'd be glad to forgo the hs-conllu name on hackage.

funarog commented 2 years ago

Bruno

I think this is a pull request: https://github.com/odanoburu/hs-conllu/pull/35 https://github.com/odanoburu/hs-conllu/pull/35

As far as accepting changes or modifications, feel free to accept or reject anything you wish.

Greg

On Jun 27, 2022, at 7:01 AM, bc² @.***> wrote:

hi @funarog https://github.com/funarog , that's great! there are certainly design changes to be made, and it's cool you got a performance improvement :)

I don't think you can push to my version of the repository directly, so you'd have to push to your mirror/fork and make a pull request, see https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request — Reply to this email directly, view it on GitHub https://github.com/odanoburu/hs-conllu/issues/34#issuecomment-1167264517, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFB7CS7PT7ANWIE77TLJWQLVRGJ2NANCNFSM5GMEG4UA. You are receiving this because you were mentioned.

arademaker commented 2 years ago

@funarog , would you be interested in implementing the support to read/write conllup

https://universaldependencies.org/ext-format.html

??

funarog commented 2 years ago

Yes. I am interested in implementing conllup. My current goal is to provide the parsing capabilities found in the python script: https://github.com/UniversalDependencies/tools/blob/master/validate.py .Then add conllup. IS there an urgent need for conllup format?

arademaker commented 2 years ago

Yes, in the https://universalpropositions.github.io/ we are using mainly conllup format. I do have many details to fix and I would love work with haskell .. ;-)

arademaker commented 2 years ago

For validation , as @odanoburu said. We need more thinking and probably an declarative approach .. I have some ideas…