PerseusDL / canonical

This will be the base repo for all text and annotation data published in the PDL
16 stars 17 forks source link

Timeline question #82

Closed kylepjohnson closed 9 years ago

kylepjohnson commented 9 years ago

Hi,

Some of us working on the CLTK (namely @smargh ) have been refactoring our corpus import API and are very interested in changing out the old Perseus corpora for the new ones available here in canonical. We are especially excited about your mapping of authors to their TLG equivalents and, of course, the Unicode texts.

What you have so far looks great, but we are wondering about their timeline. When do you plan on reaching near parity with the website?

If you have any feedback for the CLTK, which is keen on bringing NLP support to the Perseus texts, please let us know. Thanks!

lcerrato commented 9 years ago

Hi, Sorry for the delay in replying. The short answer is that the timeline is not yet determined.

This phase of work concerns the less complex texts (public domain, single work files, straightforward xml tagset). We are working on conversion to EpiDoc/CTS compatibility and and fixing issues with the Unicode conversion, etc. We're also assigning canonical names to the works, as you noticed.

Many works currently in the Perseus corpus are composite works (multiple works in one text file) that will need to be split. There are also copyrighted works (although this mainly pertains to English translations) and more complex works, such as the lexica and grammars, that will require more attention.

We're working on more detailed documentation to outline the current work plan and update the user community on what's happening.

kylepjohnson commented 9 years ago

Thank you, Lisa. We'll keep an eye on how things develop over here. In the meantime, if there is any way that that CLTK can help, feel free for your team to be in touch.

Kyle

Kyle