Closed wissen-ist-acht closed 2 years ago
Nice approach, although I am pretty sceptical of the implementation of unit testing (separate post on that below). Really like the code to append tags in in append_tags, I will rewrite some of the fraternal_twin detection code with that in mind ;)
Some observation/problems:
in apend_tags, you do not need to return the collection. As an environment, it is changed regardless if you return it or not. Try it ;)
If you append the tag to a record that had no tags before ("NA"), resulting taglist will look like
"tags": ["NA", "St. Peter"]
Wrapping unique(c(vOldTags,vNewTags) in na.omit() likely does the trick (that kills all NAs even if no tag is appended, but collection should still work as intended, and in writing it, NAs are added again, anyway.
identify_tags will throw an error if the collection has been read with meta data only (as it requires the corpus). I suggest to insert some code like in select_by_text() in filtering.R, which also stops the function if "collection" is not a collection after all. That code also ads the functionality of applying the approach only to a subset of ids, not the whole collection, in the same way we do with the select_by family (i.e., you give a list of ids, or "all" if you want the whole collection). Basically that bit here:
select_by_text <- function(ids = null, coll, searchlist = ""){ stopifnot(inherits(coll, "Collection")) stopifnot(inherits(coll, "R6")) if(length(ids)==1){if(ids=="all"){ids <- coll$meta$id}} if(is.null(coll$corpus)){ stop("Collection has been read with meta info only. Use just_meta = FALSE in read_collections/gather_collections to be able to search in texts") } else{
Do you think the append_tags() function might be used outside the tag_by_regex wrapper? If no, all fine. If yes, maybe add some code to append_tags that throws an error if taglist is not in a proper format? More of a bonus, really ;)
On that unit testing... points that bug me with that implementation here:
The way you create your test collection takes an extraordinary amount of time, and does so while loading the package. There is no need to create a test collection of four records by loading all 930,000 (which takes several minutes!) and subset it.
The test assumes that the four test records are stable. But if we rebuild the collections, maybe from changed page xml or with changes in OCR corrections, text may change, or worst case, any of your four records might be deleted. In my view, the test cases have to be independent form the current yearly avis-data collection.
At least for code on the productive branch, if there is unit testing, it has to be in a coordinated way. Likely starting with a fixed, seperate folder for any test collections / cases (and I think on the avisblatt repo, not in the data repos which change independently), a proper way of naming them, and some agreement how the testing code is executed. Maybe even a joint script for all unit testing? -> If each script in avisblatt.R loads/creates some collections and runs stuff on it, the package gets cluttered with test collections, and package loading might take forever. Here it took me 4-5 minutes, and memory usage in the end had went up to 1.81 GB...
I would suggest that before implementing anything like this, we discuss unit testing in general, preferably also with @mbannert. In other words, on my behalf no prob to merge the new functions to the productive branch once all quirks are removed, but for now, it should be without the unit testing code.
on the unit testing: yes, the very long creation process (and the other pitfalls you mentioned) bugged me as well, and I agree with the proposals - and yes, let's bug @mbannert with this :)
in any case i'd suggest a unit testing framework like testthat
or tinytest
. Check out the {usethis} boilerplate package for this.
tag_by_regex should work now, with relative paths.