Closed lsh-0 closed 10 months ago
some patterns emerging:
rawJATS
should probably be entirely lazy.extract_nodes
should probably be lazy everywhereif extract(thing); then for each in extract(thing): ...
Efficiency through the roof!
yeah, I'm pretty pleased ;)
the elifetools tests are all passing, the bot-lax tests are all passing, the md5 guinea pig tests are passing and the fact that it's taking the same amount of time to validate the results gives me confidence that the article-json emitted hasn't changed ... buuuut I think I want to figure out some way of ensuring that the data is still correct. Not sure how though. A hash check by revision? Some revisions would cause a change in the output, but most article-json would stay the same most of the time.
Good to be cautious, although the unitests for this library are very comprehensive, and it would be hard to lose data while staying green.
Definitely comprehensive. bot-lax tests are less comprehensive but I know there are full VOR article comparisons in it which would be hard to trick.
I'm thinking a separate ad-hoc Jenkins test where we could specify two revisions and it would generate a set of article-json for both and then compare them in different ways. I'll think about this some more.
... and then compare them in different ways. I'll think about this some more.
I think start off with a very narrow focus. In this case we expect the output from both revisions to be perfectly equal. We can do fuzzier comparisons later if needed.
ok @gnott , could I get you to look at the (massive number of) changes and let me know what you think?
just a guide to the types of changes made:
If a function in rawJATS needed a lazy version because something else was accessing just a bit of the results, then it was rewritten and the old function simply 'realises' the whole lazy sequence by calling list(lazy_version_of_fn(soup, ...))
. This means no logic is duplicated and all of the tests you've written continue to cover the same code. It also means I don't have to write new tests.
Some functions were switched without a lazy version because they were always accessing just the first element, or returning the first of a filtered list etc. Functions like rawJATS.article_title
and rawJATS.meta_article_id
.
There are a couple of places where we may still benefit from going lazy, but these are large functions with a lot of logic and should probably happen in their own separate PR. Things like json_rewrite.py
and parseJATS.render_raw_body
and parseJATS.body_block_render_content
.
Oh! and I had to rename parseJATS.list
to parseJATS.get_list
. It's a breaking change to the interface but I couldn't find anything using it.
Perhaps get_list
isn't so good, either. list_
or something else may be more obvious
ha! - on the contrary - I started mucking about with elifetools because I was procrastinating about other work ;) I have to rein myself in from having too much fun.
I'm fixing up one small improvement with remove_doi_paragraph
now.
Should we ever hit a major version number like 1.0 or 2.0, we might get rid of the separation between lazy and eager functions and make them all lazy. For now I think I've hit a good compromise preserving the library interface by just calling list(...)
on their lazy implementation but something to think about.
thanks for your review, @gnott , it's appreciated.
fingers crossed, merging in now.
introduces 'lazy' versions of existing functions that yield values as needed instead of all at once.
bot-lax PR using this branch: https://github.com/elifesciences/bot-lax-adaptor/pull/1009