Simplify Sly Parser - Githubissues

CambridgeSemiticsLab / nena_corpus

The NENA corpus in plain-text markup

Creative Commons Attribution 4.0 International

2 stars 2 forks source link

Simplify Sly Parser #6

Closed codykingham closed 4 years ago

codykingham commented 4 years ago

Currently we use sly to parse .nena texts. But sly seems prone to superfluous error messages and delicate to inconsistencies. Would it be better to instead write a class that can ingest and validate .nena texts without sly? That would give us more control on what the parser should and should not choke on.

codykingham commented 4 years ago

This would also eliminate another dependency, which we should seek to do as much as possible.

codykingham commented 4 years ago

We should't get rid of Sly. It's a good solution. Rather, instead the Sly parser code should be updated and simplified.

codykingham commented 4 years ago

We now have good progress on the parser, see https://github.com/CambridgeSemiticsLab/nena_corpus/blob/master/parse_nena/NenaParser2.ipynb

Next steps to implement the parser include:

re-write steps in the ms doc converter so that all NENA texts are in the new standard markup
test NenaParser2 on all documents, implement any necessary changes to parser
run NenaParser2 on all documents, store resulting data in JSON format for use with Text-Fabric or other conversions

codykingham commented 4 years ago

The new parser is complete in 84453e577c9a63a22c61580ed4cf71f00307a47a. There may be some edge cases to account for in the future. We'll keep an eye on that and update the parser as needed. For now, the bulk of the code is in place.