Docs for Accessing Corpora

SigmaX commented 6 years ago

Either I'm missing something obvious (which is likely), or CLTK offers no documentation on how to use the various corpora the project provides.

After importing the greek_text_perseus corpus, for example, its README.md tells me

This repository holds the Greek files made available by the Perseus Project. See the CLTK's docs for instructions on how to use these files.

The docs, however, only cover how to download corpora and how to process raw text stored in a Python variable, respectively, omitting the intermediate steps. There is no mention of how one might import a corpus after downloading it (which, I see from this external blog, seems to be a thing?), or how one might otherwise get ahold of a CorpusReader object (assuming such a thing exists, which is not clear from the docs).

From all this I infer that it seems we are intended to

Use CLTK to conveniently download corpora, but not to load them.
Use NLTK or some other 3rd-party tool to load the corpora directly from the resulting text or XML files.
Proceed as usual with NLP analysis, turning to CLTK only when we need language-specific processing capabilities at a low level.

Am I correct in piecing together this puzzle? If so, I haven't seen such a scheme spelled out anywhere in the docs. Perhaps I am blind to something?

clemsciences commented 6 years ago

I totally agree. I had the same thoughts when I used for the first time cltk. My idea to explain that is to make a new tutorial with that intermediate part explained.

diyclassics commented 6 years ago

@SigmaX—thanks for starting this discussion. There is a lot of work that could be (and should be) done with making corpora easier to work with. This is why I wrote a PlaintextCorpusReader wrapper for the CLTK Latin Library corpus (basically, # 2 from your list); cf. https://disiectamembra.wordpress.com/2016/08/11/working-with-the-latin-library-corpus-in-cltk/. I'll be sure to add this functionality to the docs.

There has been some discussion here of adding more wrappers like this, esp. XMLCorpusReader wrappers for the Perseus texts (cf. https://github.com/cltk/cltk/issues/554). If there is interest, I can revisit this. I'd be happy to hear which other corpora you would like better access to as well.

Also, my guess is that # 3 from your list is the way CLTK is used for the most part. But in the interest of a self-contained NLP workflow, I think a better defined pipeline from corpus/data to analysis would be worth pursuing.

SigmaX commented 6 years ago

@diyclassics The first thing I did is hack together my own (probably flawed) XML reader for Perseus, so I agree that providing that would be a useful feature!

IMO, though, 80% of the problem can be solved just by pointing out in the docs that new users will have to find a way to parse the corpora themselves. In my case, it took me longer to figure out that I needed to manually parse the corpus I was interested in than it did to actually parse it!

Thanks so much for providing these tools! It's an exciting time to be alive.

kylepjohnson commented 6 years ago

Thank you @SigmaX for raising this valid issue. The challenge with the "toolkit" idea is that things can get messy with more and more contributions come in (a good problem to have, I suppose :)

For tutorials, we have a special repo, however it's never quite been polished enough that I wanted to push it in the docs: https://github.com/cltk/tutorials. Someone could make some notebooks illustrating how to put the pieces together.

Our Greek corpus reader uses a 3rd party tool (MyCapytain) which currently only works for Greek: http://docs.cltk.org/en/latest/greek.html#tei-xml.

The first thing I did is hack together my own (probably flawed) XML reader for Perseus

We'd love to see it. Could you drop it in a gist, with an example of how to run it, so we can take a look?

SigmaX commented 6 years ago

@kylepjohnson : well, all I did was write a brittle loop that pulled out the text inside every <p> or <q> element, which seemed adequate for the specific text I was working with (Perseus' Meditations). The TEI DTD is very intricate, however, so it would take some work to tell exactly what is needed generalize accurately to arbitrary corpora.

I also took a stab at getting NLTK's XMLCorpusReader to work.

The first issue I encountered is that it it pulls ElementTree's text attribute from every tag (which seemed reasonable), but in ElementTree's (somewhat strange) DOM interpretation, text turns up empty or incomplete if empty tags occur inside the text. So I modified XMLCorpusReader to also pull ElementTree's tail attribute—this way it really does extract all text from every node.
The second is that XMLCorpusReader croaks on entity references defined in the TEI DTD. Trying to load ~/cltk_data/greek/text/greek_text_perseus/Epictetus/opensource/epictetus_gk.xml, for instance, yields ParseError: undefined entity &responsibility;: line 13, column 0.

Of course, you've already noted (#554) that the XMLCorpusReader strategy, when it does work, pulls unwanted metadata anyway. And now that I realize Captains/myCaptain exists (#560), I'll probably try going that route next.

kylepjohnson commented 6 years ago

And now that I realize Captains/myCaptain exists (#560), I'll probably try going that route next.

Because of the complexity of TEI, I think this probably the best thing to do.

Since you're clearly an able coder, we'd be interested in seeing how you solve this one, even if you don't think the code is production-ready. And of course we're happy to help with any issues you want an extra pair of eyes on.

Should we close this?

SigmaX commented 6 years ago

Should we close this?

Personally I'd leave it open until there is at least a sentence in the docs either pointing to a tutorial or saying "you need to figure out how to import the corpora yourself" or such.

kylepjohnson commented 6 years ago

Sure. I'll take care of this and post back here for people to comment on :)

jtauber commented 6 years ago

The helper code that Eldarion is developing on top of MyCaptains for the new Perseus will likely help with this. It will hopefully be open source in the next month or so.

markomanninen commented 6 years ago

Could someone point to the tutorial where greek perseus corpora is used to read any text say homer iliad as a plain readable format?

diyclassics commented 6 years ago

@markomanninen—Perseus reader is still an open issue (#361, e.g.).

There are some ways to go about it outside of CLTK though: 1. MyCapytain (cc: @PonteIneptique) http://mycapytain.readthedocs.io/en/latest/ is one option; and 2. I show in this tutorial/notebook how to get plaintext perseus texts using requests/lxml: https://github.com/diyclassics/perseus-experiments/blob/master/Perseus%20Plaintext%20Poetry.ipynb Let me know if how these options work for you and I'll be sure to move up XML readers in my CLTK work queue.

jtauber commented 6 years ago

Let me help a little by adding an issue to Scaife to provide a plain text render of a passage directly on Perseus.

https://github.com/scaife-viewer/scaife-viewer/issues/213

markomanninen commented 6 years ago

Thanks for quick feedback. Ill try that notebook. Looks good to me. But could I use already imported corpora from my local machine? I mean cltk import corpora does it fine and I can spot files from my home directory. Problem is that xml or json format should be parsed and some interface would be helpful to retrieve chapters and verses. Anyway let me try.

PonteIneptique commented 6 years ago

Hi @markomanninen , some context for local corpora use : Capitains.org guidelines are used by Perseus and the OpenGreekAndLatin project to encode their text. The requirements for the xml TEI is quite small (Guidelines) and so if you are working with xml files from other providers, you could easily "convert" them for being read by Capitains tools.

Once you have those files, there is few way to deal with them :

Open them one by one : you can read and require specific passage from the files and export them to PlainText ( http://mycapytain.readthedocs.io/en/latest/MyCapytain.local.html for an example)
If you have compatible repositories (with metadata), you can make queries to them locally through the resolver ( Resolver API Documentation ) . I have unfortunately no good example of the use of a local resolver except for the tests ( Tests for local resolver ). I can point you to the rational behind them though ( Generic documentation about resolvers )

Finally, there has been a course using it in SunoikisisDC

markomanninen commented 6 years ago

Thanks @PonteIneptique It looks like MyCapytain requires lxml.etree version lower than 3.8.0 which are not compatible with my windows systems. I could try to install earlier version of mycapytain, or try other ways of parsing xml data...

PonteIneptique commented 6 years ago

@markomanninen I think it should work with <3.8.0 . Maybe we could move the conversation elsewhere but it seems there is a Windows wheel for 3.8.0 : https://pypi.python.org/pypi/lxml/3.8.0 ?

markomanninen commented 6 years ago

Yeah, this is another issue. My sys.version is:

3.5.4 |Continuum Analytics, Inc.| (default, Aug 14 2017, 13:41:13) [MSC v.1900 64 bit (AMD64)]

but from lxml import etree gives error. Only this one works:

import xml.etree.cElementTree as etree

which should be valid import for 2.5+ (ref: http://lxml.de/tutorial.html)

So to get mycapitain work in my system, I would need to fork and modify that module first...

markomanninen commented 6 years ago

I made a gist for parsing local file, this is very raw version, not using any xml parser, thus might have some issues:

https://gist.github.com/markomanninen/a68f200b4e98f018d7618dab0365ffe5#file-perseus_local_file_parser-py

swasheck commented 6 years ago

This is still an issue. Once the corpus has been downloaded, what can we do? It seems like primary value is to use all of the sets as training sets and then ... uhh ... analyze something ... somehow. Additionally, "latest" documentation is broken ("Concordance" doesn't work as there's no method call write_concordance_from_file() anymore, I guess?).

cltk / cltk

Docs for Accessing Corpora #615