JuliaText / CorpusLoaders.jl

A variety of loaders for various NLP corpora.
Other
32 stars 13 forks source link

CoNLL support #10

Open ksteimel opened 6 years ago

ksteimel commented 6 years ago

It'd be good to support CoNLL format in a generic sense (and then perhaps some of the more specific CoNLL formats as an offshoot). I'd be happy to work on this if this is something you think would be worth it.

oxinabox commented 6 years ago

I think it is worth it yes.

@evizero has support for it in MLDatasets.jl https://github.com/JuliaML/MLDatasets.jl/blob/master/src/CoNLL.jl which would be a starting point.

if that is ported across, and enhanced to match the CorpusLoaders style:

And is working well, perhaps we can talk about deprecating it out of MLDatasets.jl. Though there are perhaps pros to having two loaders for that, since MLDatasets.jl's is much simpler maybe.

Ayushk4 commented 5 years ago

I am starting with the addition of CoNLL 2003 Corpus. The original files from the shared task are freely available.

To extract the required files from it, one needs to have the Reuters Corpus file rcv1.tar.xz and build the original files with it. This is available from Dataverse Harward or NIST website. However, obtaining the Reuters corpus requires a user agreement and maybe some time for it to get approved.

Instead of doing this, there are files of CoNLL 2003 that have been built and are openly available.

I feel it will be very very difficult to take care of the downloading part with the former method and that I should go with the latter approach. What do you suggest in this case?

Edit: I feel the latter approach will be simpler overall as well as easier to multiplicate this to other CoNLL datasets.

oxinabox commented 5 years ago

The later sounds legit