Closed SigmaX closed 3 years ago
I totally agree. I had the same thoughts when I used for the first time cltk. My idea to explain that is to make a new tutorial with that intermediate part explained.
@SigmaX—thanks for starting this discussion. There is a lot of work that could be (and should be) done with making corpora easier to work with. This is why I wrote a PlaintextCorpusReader wrapper for the CLTK Latin Library corpus (basically, # 2 from your list); cf. https://disiectamembra.wordpress.com/2016/08/11/working-with-the-latin-library-corpus-in-cltk/. I'll be sure to add this functionality to the docs.
There has been some discussion here of adding more wrappers like this, esp. XMLCorpusReader wrappers for the Perseus texts (cf. https://github.com/cltk/cltk/issues/554). If there is interest, I can revisit this. I'd be happy to hear which other corpora you would like better access to as well.
Also, my guess is that # 3 from your list is the way CLTK is used for the most part. But in the interest of a self-contained NLP workflow, I think a better defined pipeline from corpus/data to analysis would be worth pursuing.
@diyclassics The first thing I did is hack together my own (probably flawed) XML reader for Perseus, so I agree that providing that would be a useful feature!
IMO, though, 80% of the problem can be solved just by pointing out in the docs that new users will have to find a way to parse the corpora themselves. In my case, it took me longer to figure out that I needed to manually parse the corpus I was interested in than it did to actually parse it!
Thanks so much for providing these tools! It's an exciting time to be alive.
Thank you @SigmaX for raising this valid issue. The challenge with the "toolkit" idea is that things can get messy with more and more contributions come in (a good problem to have, I suppose :)
For tutorials, we have a special repo, however it's never quite been polished enough that I wanted to push it in the docs: https://github.com/cltk/tutorials. Someone could make some notebooks illustrating how to put the pieces together.
Our Greek corpus reader uses a 3rd party tool (MyCapytain
) which currently only works for Greek: http://docs.cltk.org/en/latest/greek.html#tei-xml.
The first thing I did is hack together my own (probably flawed) XML reader for Perseus
We'd love to see it. Could you drop it in a gist, with an example of how to run it, so we can take a look?
@kylepjohnson : well, all I did was write a brittle loop that pulled out the text inside every <p>
or <q>
element, which seemed adequate for the specific text I was working with (Perseus' Meditations). The TEI DTD is very intricate, however, so it would take some work to tell exactly what is needed generalize accurately to arbitrary corpora.
I also took a stab at getting NLTK's XMLCorpusReader
to work.
text
attribute from every tag (which seemed reasonable), but in ElementTree's (somewhat strange) DOM interpretation, text
turns up empty or incomplete if empty tags occur inside the text. So I modified XMLCorpusReader
to also pull ElementTree's tail
attribute—this way it really does extract all text from every node.XMLCorpusReader
croaks on entity references defined in the TEI DTD. Trying to load ~/cltk_data/greek/text/greek_text_perseus/Epictetus/opensource/epictetus_gk.xml
, for instance, yields ParseError: undefined entity &responsibility;: line 13, column 0
.Of course, you've already noted (#554) that the XMLCorpusReader
strategy, when it does work, pulls unwanted metadata anyway. And now that I realize Captains/myCaptain exists (#560), I'll probably try going that route next.
And now that I realize Captains/myCaptain exists (#560), I'll probably try going that route next.
Because of the complexity of TEI, I think this probably the best thing to do.
Since you're clearly an able coder, we'd be interested in seeing how you solve this one, even if you don't think the code is production-ready. And of course we're happy to help with any issues you want an extra pair of eyes on.
Should we close this?
Should we close this?
Personally I'd leave it open until there is at least a sentence in the docs either pointing to a tutorial or saying "you need to figure out how to import the corpora yourself" or such.
Sure. I'll take care of this and post back here for people to comment on :)
The helper code that Eldarion is developing on top of MyCaptains for the new Perseus will likely help with this. It will hopefully be open source in the next month or so.
Could someone point to the tutorial where greek perseus corpora is used to read any text say homer iliad as a plain readable format?
@markomanninen—Perseus reader is still an open issue (#361, e.g.).
There are some ways to go about it outside of CLTK though: 1. MyCapytain (cc: @PonteIneptique) http://mycapytain.readthedocs.io/en/latest/ is one option; and 2. I show in this tutorial/notebook how to get plaintext perseus texts using requests/lxml: https://github.com/diyclassics/perseus-experiments/blob/master/Perseus%20Plaintext%20Poetry.ipynb Let me know if how these options work for you and I'll be sure to move up XML readers in my CLTK work queue.
Let me help a little by adding an issue to Scaife to provide a plain text render of a passage directly on Perseus.
Thanks for quick feedback. Ill try that notebook. Looks good to me. But could I use already imported corpora from my local machine? I mean cltk import corpora does it fine and I can spot files from my home directory. Problem is that xml or json format should be parsed and some interface would be helpful to retrieve chapters and verses. Anyway let me try.
Hi @markomanninen , some context for local corpora use : Capitains.org guidelines are used by Perseus and the OpenGreekAndLatin project to encode their text. The requirements for the xml TEI is quite small (Guidelines) and so if you are working with xml files from other providers, you could easily "convert" them for being read by Capitains tools.
Once you have those files, there is few way to deal with them :
Finally, there has been a course using it in SunoikisisDC
Thanks @PonteIneptique It looks like MyCapytain requires lxml.etree version lower than 3.8.0 which are not compatible with my windows systems. I could try to install earlier version of mycapytain, or try other ways of parsing xml data...
@markomanninen I think it should work with <3.8.0 . Maybe we could move the conversation elsewhere but it seems there is a Windows wheel for 3.8.0 : https://pypi.python.org/pypi/lxml/3.8.0 ?
Yeah, this is another issue. My sys.version is:
3.5.4 |Continuum Analytics, Inc.| (default, Aug 14 2017, 13:41:13) [MSC v.1900 64 bit (AMD64)]
but from lxml import etree
gives error. Only this one works:
import xml.etree.cElementTree as etree
which should be valid import for 2.5+ (ref: http://lxml.de/tutorial.html)
So to get mycapitain work in my system, I would need to fork and modify that module first...
I made a gist for parsing local file, this is very raw version, not using any xml parser, thus might have some issues:
This is still an issue. Once the corpus has been downloaded, what can we do? It seems like primary value is to use all of the sets as training sets and then ... uhh ... analyze something ... somehow. Additionally, "latest" documentation is broken ("Concordance" doesn't work as there's no method call write_concordance_from_file()
anymore, I guess?).
Either I'm missing something obvious (which is likely), or CLTK offers no documentation on how to use the various corpora the project provides.
After importing the
greek_text_perseus
corpus, for example, itsREADME.md
tells meThe docs, however, only cover how to download corpora and how to process raw text stored in a Python variable, respectively, omitting the intermediate steps. There is no mention of how one might
import
a corpus after downloading it (which, I see from this external blog, seems to be a thing?), or how one might otherwise get ahold of aCorpusReader
object (assuming such a thing exists, which is not clear from the docs).From all this I infer that it seems we are intended to
Am I correct in piecing together this puzzle? If so, I haven't seen such a scheme spelled out anywhere in the docs. Perhaps I am blind to something?