comphist / cora

A web-based, token-level annotation tool for non-standard language data
http://www.linguistics.rub.de/comphist/resources/cora/
MIT License
10 stars 6 forks source link

Refactor CoraDocument #40

Open mbollmann opened 9 years ago

mbollmann commented 9 years ago

Originally reported by: Marcel Bollmann (Bitbucket: mbollmann, GitHub: mbollmann)


The CoraDocument class is supposed to be an internal (PHP) representation of a CorA document. It's also currently implemented in a completely insane way.

How it's currently implemented and used

CoraDocument is used for import and export. For import, a CoraDocument instance is created (e.g. by XMLHandler) and fed into DocumentCreator, which performs the actual database import. For export, CoraDocument has a static method "fromDB" to create an instance based on a document in the DB, which is then used by Exporter to generate a file.

The insanity lies in the representation of structure: elements in CoraDocument point to each other either via "db_id" or "xml_id", depending on where the data came from; also, the "db_id -> xml_id" conversion logic is provided by XMLHandler, while the "xml_id -> db_id" logic is implemented in CoraDocument itself.

Why it should be changed

By using "db_id"/"xml_id", the structural representation is not independent of the storage format. CoraDocument would be most useful if it were independent of any storage format. In particular, it shouldn't even have to know about such a thing as XML IDs.

The current implementation prevents us, e.g., from easily cloning a document in the database: ideally, you would instantiate a CoraDocument from the database, and feed it back to DocumentCreator to get a new copy of it in the database. This is not possible, though: the CoraDocument contains lots of references to database IDs of the old document, which of course mustn't be used when inserting it as a new document, but there is no way to recreate the database IDs without using "xml_id" or destroying the structural information.

What could to be done

Find a way to represent a CoraDocument without any references to database IDs or XML IDs. Elements should refer to each other via neutral IDs, or simply via array indices. Mappings from these elements to IDs in the database or XML should be kept separate from the actual data. Conversion logic should be moved completely into XMLHandler, if possible.