Automating the collection of examples

Daniel-Mietchen commented 9 years ago

So far, most of the examples we have discussed have been identified manually. I am thinking about a systematic approach to collecting examples for sets of tags that we consider.

One way to go about that would be to mine PMC's OA Subset (which can be downloaded in bulk) for uses of specific tags and to condense that (perhaps along with any manually provided examples from outside PMC's OA Subset) into some basic usage patterns (think tag-level dialects) that we could use as a basis for discussing best practices and distilling recommendations.

I can think of a number of effects that this may have:

If we compile these stats on a regular basis, we can track the evolution of tagging patterns and use that to
- monitor uptake of JATS4R recommendations
- identify cases where JATS4R recommendations may be useful or in need of revision
It would be possible to create mappings between certain dialects and the JATS4R recommendations for the corresponding tags, such that non-compliant articles could be more easily rendered, analyzed or otherwise used than they can now. Some of these mappings could possibly be crowdsourced (e.g. image-only formulas might be transcribable using CAPTCHA-like mechanisms in places frequented by TeX-savvy users).
The error messages in our schematrons could then point to those tag-level dialects and our accompanying annotations as to why they are compliant with our recommendations or not. This would help inform and educate about tagging standards in general and JATS4R in particular.

I have started to explore this but the tools I know are not best suited for this kind of analyses on such a corpus (I am running a grep over night!), so I would welcome your ideas in this regard.

Melissa37 commented 9 years ago

That sounds brilliant

jats-laura commented 9 years ago

If we compile these stats on a regular basis, we can track the evolution of tagging patterns and use that to

monitor uptake of JATS4R recommendations

identify cases where JATS4R recommendations may be useful or in need of revision

I don't think this is quite accurate. What you get from the PMC OA subset is what PMC normalized. Nothing in that subset is as it was delivered by the publisher...nothing. PMC converts every single XML document it receives to comply with PMC style...even those submitted to us in the JATS DTD.

If the recommendations are for a part of the document that PMC doesn't need to standardize for archiving or display purposes, then sure, we'll pass it through unchanged and you can monitor the uptake. But so far, I haven't seen much PMC wouldn't make some effort to standardize in our output, so I think all you'll really be monitoring is the degree of PMC's uptake.

Daniel-Mietchen commented 9 years ago

Good points, Laura. What about making more of those normalization steps (and the accompanying tools) public? At least for XML supplied by CC BY publishers, this would seem possible.

hubgit commented 9 years ago

If you can get it to run (it's 9 years old), Stefano Mazzocchi's Gadget is a nice tool for analysing elements/attributes and their contents in large quantities of XML (e.g. everything in the PMC OA Subset).

JATS4R / JATS4R-Participant-Hub

Automating the collection of examples #94