The big merge - Githubissues

KCMertens commented 2 years ago

There are a lot of changes in this commit, but most are just noise due to some added type checking.

The database has had a refactor.
- Remove the needs_refac, needs_resave and needs_refresh columns, added needs_update as a more generic column
- Add flag column. Defaults to empty string. The flag is still kept in the XML as well (because of interop with old dictionaries and because this way we can ectract existing flags when users upload a new dictionary - assuming they conform to our way of doing things)
Entry creation/update This is completely rewritten. The core entry point is now ops.createEntry(), this is always used, whenever an entry is created or updated. With the help of some util functions the entry is completely validated and resaved every time (because the user can do whatever they want to the xml, even completely change it). This also makes it so that whenever any config has changed we just re-run this function for every entry and everything out of date property fixes itself.
Subentries The subentry system has had some changes:
- Previously:
- The subentry would be a complete instance, meaning all the subentry's xml would be contained in every parent entry.
- On load by the client, the subentry was surrounded with , which the client detected.
- On save: the was removed again and the subentry copied and saved in the database. Additionally all other parents of the subentry were updated with the subentry's new xml. This had some limitations, most importantly all xml extraction functions (for title, sortkey, flag, etc) would also match inside subentries, because they were part of the same xml tree.

The new situation:
- XML of subentries is now removed from the parentEntry, and instead a single reference/link placeholder element is inserted before it is saved to the database. I've chosen to re-use for this. The id attributes points to a regular entry in the database that contains the subentry. There is no difference in how a subentry is treated from a normal entry otherwise.
- Subentries can be edited separately, and the parents will never need an update unless the subentry is completely deleted.
- The client needs an update to retrieve the subentry's xml separately
- It is also possible to specify that the element needs a certain attribute to be considered a subentry. This was required to support newer formats such as TEI-lex0, where a subentry is just an <entry type="relatedEntry>, which we otherwise can't differentiate from a normal . This can be configured in the subbing page. It's possible to require just that a certain attribute exists, or require that the attribute exist and also has a specific value.

Flagging: Flags are now also put in a database column. For the rest, there are no changes, flags are still kept in sync with the entry's XML, but this removes the need to parse every entry when loading the entry list.
Migration: There is a migration script website/migrate.py which will migrate every dictionary's database schema. Dictionaries that use one of the rewritten features (flagging, subentries) will have every entry marked as needs_update, which will run when they are first loaded. NOTE: This is unfortunate, but the first load of the dictionary will be tremendously slow as all entries might need to be updated before the entrylist can be shown.
Other notes: I've attempted to consolidate most of the entry functions. ops.searchEntries() can search entries by their headword/searchables. It only uses and returns the IDs and Sortkeys, so results can be further used as needed. ops.sortEntries() can sort the results of searchEntries based on collator and reverse settings ops.readEntries() actually reads and processes the entries. It has some flags to retrieve xml, parse xml, extract plaintext title, or run xslt/html conversion (because those are heavy operations and you might not always need their result). The returned results are ready to send to the frontend. It also retrieves the entry's child- and parentEntry IDs, so returns their subentries, or - when it is a subentry itself - where it is used.

XML parsing is a headache - especially when cutting up larger documents, and namespace issues crop up because the declaration is lost. To that end there is now ops.parse() which parses xml in a straightforward manner - it pretty much ignores namespaces, and just treats a namespace as if the element or attribute was literally called "someNamespace:someElement". This means that to match inside the result, you need to specify the namespace prefix inside the attribute/element name.

Importing a dictionary is also mostly rewritten because it could not handle elements being nested inside themselved, as the regex would match up wrong. Ex: ` ...

` would previously match up the opening tag with the subentry closing tag and horribly crash because of unbalanced tags.

I've also attempted to include type information as much as possible, and codified most of the configs used in the back-end. This requires Python 3.8, but I very strongly think the developer quality-of-life we get back was worth it.

mjakubicek commented 2 years ago

I'm afraid this is very, very far from being ready to be merged. If this should ever get merged, it needs to be dismantled into individual atomic meaningful commits. There are lot of changes that you do not mention in the PR description, such as those in the import script (which by the way overwrite some recent changes in an incompatible way), each independent change must be a separate commit.

KCMertens commented 2 years ago

Alright, with some caveats, that can happen.

It definitely did not start its life as a big bulk commit, but the hole grew deep indeed as the main branch diverged from my work, and I should have been much more on the ball there.

I'm going to take some time to think about the best approach to split out individual parts without having to patch the surrounding lexonomy code too much during intermediate commits.

On Tue, May 24, 2022, 21:08 Milos Jakubicek @.***> wrote:

I'm afraid this is very, very far from being ready to be merged. If this should ever get merged, it needs to be dismantled into individual atomic meaningful commits. There are lot of changes that you do not mention in the PR description, such as those in the import script (which by the way overwrite some recent changes in an incompatible way), each independent change must be a separate commit.

— Reply to this email directly, view it on GitHub https://github.com/elexis-eu/lexonomy/pull/280#issuecomment-1136276906, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADTR3N3G4SIXCOWFS3QYYCTVLULI7ANCNFSM5WZ5VRDQ . You are receiving this because you authored the thread.Message ID: @.***>

iztokkosem commented 2 years ago

which by the way overwrite some recent changes in an incompatible way

Miloš, can I just ask about these recent changes? Made by whom? Adam, Koen and Matic are now coordinating their efforts, so I would assume Koen would be aware of such changes.

rambousek commented 2 years ago

which by the way overwrite some recent changes in an incompatible way

Miloš, can I just ask about these recent changes? Made by whom? Adam, Koen and Matic are now coordinating their efforts, so I would assume Koen would be aware of such changes.

I was working with Tomas on bug fixes and new features in the new GUI and the frontend code was rewritten a lot in past month. Matic and Koen were working with month or two version as a base. So this update is overwriting changes done in recent weeks. That is the reason why separate commit is needed for each change - to combine everything without losing work done.

If we want to include this big update fast, it should be created as a new branch.

mjakubicek commented 2 years ago

which by the way overwrite some recent changes in an incompatible way

Miloš, can I just ask about these recent changes? Made by whom? Adam, Koen and Matic are now coordinating their efforts, so I would assume Koen would be aware of such changes.

I'm sure he is/was aware of them (receiving the commits being pushed to the main branch) since the PR was rebased against those and overwrites them. (What I checked was just what I did: a few bugfixes from the last month, such as fixing the import script where history was never purged, so reuploading data caused the dictionary size to explode after a few times even when users chose to purge -- this e.g. was 8412622)

This is a minor issue though and easy to be fixed, certainly compared to the fact that what should be likely some 300 commits are just 6 commits changing almost 4,000 lines of code.

KCMertens commented 2 years ago

Adam, because it needed to be rebased, I had to perform a force-push. There is a backup of the branch's previous state here in INL/lexonomy/

elexis-eu / lexonomy

The big merge #280