Reading data produced using a11ytables

jack-davison commented 1 year ago

Hi Matt,

Thanks very much for this package. I saw your talk at EARL2022 on {a11ytables} but didn't get the opportunity to ask the following.

We often use UK government statistics in the format described and, while the format is accessible in a spreadsheet, it can be painful to read it into R for analysis (data doesn't always start on the same line, there can be multiple independent data tables per sheet, multiple sheets per file, etc.). Often a more code-friendly/tidy format (e.g., a simple csv) just isn't available, so we've had to DIY solutions to iterate over different sheets, detect where the data is when there are multiple tables per sheet, and so on.

Is it on the roadmap to write a read_allytable() function to do the inverse of the current package functionality, i.e., take an accessible spreadsheet saved locally and turn it back into a list of tidy tibbles in R?

Cheers, Jack

matt-dray commented 1 year ago

Thanks for getting in touch, Jack. This is a great question and I like the idea.

We'd need to detect whether a given spreadsheet was created:

With {a11ytables} or gptables (the Python analogue package) or meets best-practice guidance without these packages (definitely in scope)
With these packages, but the output has been adjusted slightly (might be tricky, depending on how off-piste the changes are)
Without these packages, and doesn't meet the guidance (out of scope)

At simplest, we could detect the tabs containing tables—or provide an argument to users to specify them—and then extract the table(s) alone. however, it might be useful to output a list object where each element represents a tab and we have elements for the title, presence of notes, the table(s), etc.

Related: in case you haven't seen them before, our colleague Duncan (@nacnudus) has written some great packages, {tidyxl} and {unpivotr}, for general-purpose spreadsheet parsing and wrangling. There's an associated online book too.

matt-dray commented 6 months ago

Another approach: use Fran's package {odsTableReadr} to identify each of the tables in such a workbook and then extract meta-information from cell A1 to the row just above the start of a table.

co-analysis / a11ytables

Reading data produced using a11ytables #90