elifesciences / elife-tools

Python library for parsing eLife article XML data.
MIT License
15 stars 7 forks source link

laziness #435

Closed lsh-0 closed 10 months ago

lsh-0 commented 10 months ago

introduces 'lazy' versions of existing functions that yield values as needed instead of all at once.

bot-lax PR using this branch: https://github.com/elifesciences/bot-lax-adaptor/pull/1009

lsh-0 commented 10 months ago

some patterns emerging:

gnott commented 10 months ago

Efficiency through the roof!

lsh-0 commented 10 months ago

yeah, I'm pretty pleased ;)

Screenshot at 2023-11-24 13-38-29

lsh-0 commented 10 months ago

the elifetools tests are all passing, the bot-lax tests are all passing, the md5 guinea pig tests are passing and the fact that it's taking the same amount of time to validate the results gives me confidence that the article-json emitted hasn't changed ... buuuut I think I want to figure out some way of ensuring that the data is still correct. Not sure how though. A hash check by revision? Some revisions would cause a change in the output, but most article-json would stay the same most of the time.

gnott commented 10 months ago

Good to be cautious, although the unitests for this library are very comprehensive, and it would be hard to lose data while staying green.

lsh-0 commented 10 months ago

Definitely comprehensive. bot-lax tests are less comprehensive but I know there are full VOR article comparisons in it which would be hard to trick.

I'm thinking a separate ad-hoc Jenkins test where we could specify two revisions and it would generate a set of article-json for both and then compare them in different ways. I'll think about this some more.

lsh-0 commented 10 months ago

... and then compare them in different ways. I'll think about this some more.

I think start off with a very narrow focus. In this case we expect the output from both revisions to be perfectly equal. We can do fuzzier comparisons later if needed.

lsh-0 commented 10 months ago

ok @gnott , could I get you to look at the (massive number of) changes and let me know what you think?

just a guide to the types of changes made:

If a function in rawJATS needed a lazy version because something else was accessing just a bit of the results, then it was rewritten and the old function simply 'realises' the whole lazy sequence by calling list(lazy_version_of_fn(soup, ...)). This means no logic is duplicated and all of the tests you've written continue to cover the same code. It also means I don't have to write new tests.

Some functions were switched without a lazy version because they were always accessing just the first element, or returning the first of a filtered list etc. Functions like rawJATS.article_title and rawJATS.meta_article_id.

There are a couple of places where we may still benefit from going lazy, but these are large functions with a lot of logic and should probably happen in their own separate PR. Things like json_rewrite.py and parseJATS.render_raw_body and parseJATS.body_block_render_content.

lsh-0 commented 10 months ago

Oh! and I had to rename parseJATS.list to parseJATS.get_list. It's a breaking change to the interface but I couldn't find anything using it.

Perhaps get_list isn't so good, either. list_ or something else may be more obvious

lsh-0 commented 10 months ago

ha! - on the contrary - I started mucking about with elifetools because I was procrastinating about other work ;) I have to rein myself in from having too much fun.

I'm fixing up one small improvement with remove_doi_paragraph now.

Should we ever hit a major version number like 1.0 or 2.0, we might get rid of the separation between lazy and eager functions and make them all lazy. For now I think I've hit a good compromise preserving the library interface by just calling list(...) on their lazy implementation but something to think about.

lsh-0 commented 10 months ago

thanks for your review, @gnott , it's appreciated.

fingers crossed, merging in now.