JuliaText / CorpusLoaders.jl

A variety of loaders for various NLP corpora.
Other
32 stars 13 forks source link

SenseEval support #9

Closed oxinabox closed 5 years ago

oxinabox commented 6 years ago

Rada Mihalcea provides then SenseEval corpora 2 and 3, in SemCor format http://web.eecs.umich.edu/~mihalcea/downloads.html#sensevalsemcor

Thus because we have a SemCor parser already we basically already support them. It is more a matter of writing the data deps registration, than any real parsing.

This would be a good and easy PR to make

ksteimel commented 6 years ago

I can help out with this after next week. However, should I wait until your 'fresh' branch is merged?

oxinabox commented 6 years ago

Yes, I'll try and merge it before next week then. It's vaguely waiting for me to port more stuff from the old version, though I can look up the old version from it's tag.

oxinabox commented 6 years ago

@ksteimel that took longer than expected but the fresh branch is now merged

Ayushk4 commented 5 years ago

I would like to work on this issue.

oxinabox commented 5 years ago

Feel free. I will review any PRs.

Ayushk4 commented 5 years ago

I am receiving the following error while parsing one of the file in senseval2 corpus. Error parsing "<wf cmd=done id=d00.s09.t01 pos=NNS lemma=other wnsn=0 lexsn=U>others</wf>". ErrorException("type Void has no field captures")

This error has been traced back to a function similar to the one here -https://github.com/JuliaText/CorpusLoaders.jl/blob/58c824dbff95cbb3c3107377750a54d909944932/src/SemCor.jl#L33

The error is caused by lexsn being matched as empty.

What might be the best way around this? One way could be by adding exception handling and change the match expression for it. Is there any better way to do this?

oxinabox commented 5 years ago

I would not add exception handling. Julia code prefers to be written to avoid exceptions rather than handle them,. (Unlike python julia exception handling is pretty slow).

I think the regex can probably be relaxed some this bit: lexsn=(\d.*:\d*)

So that U is also acceptable. Some of the logic after that also will wnat adjusting. But in the SenseAnnotatedWord the type of the lexen field is still String