baumer-lab / wikitablr

Simple Reader for Wikipedia Tables
Other
0 stars 0 forks source link

consider not-exporting `_single()` functions #11

Closed beanumber closed 4 years ago

beanumber commented 4 years ago

Instead, make functions like clean_wiki_names() S3 generics, with methods for both class list and class data.frame. That way, the user only has one set of function names to remember, and doesn't have to worry about whether they have one or more tables.

Hmm....not sure about this, but it's worth thinking about...

rporta23 commented 4 years ago

I don't think users can currently use the _single() functions. I tried to use them to write the tests and it told me the functions couldn't be found.

beanumber commented 4 years ago

So I think the S3 generics make sense then. That way, the users won't have to choose a different function or think about whether they have a list or a single item.

beanumber commented 4 years ago

Another option would be to think about using nest() to put all of the table information in a list-column.

beanumber commented 4 years ago

The philosophical question is what we want the fundamental unit to be here? Is it a table? Or is it a Wikipedia page (i.e., a URL)? The former is just a data.frame, but the latter is a list of data frames.

If we do go the list-column route, it would be more straightforward, in that you would start with a URL, and always return a tibble. The tibble would have one row for each table. Another advantage of this approach is that the columns could contain metadata. So for example, the URL, the timestamp when the data was retrieved, the table number on the page, and then finally a list-column of data.frames with the actual data.

If someone wants just one table, then can either

read_wikitables(url) %>%
  filter(table = 3) %>%
  pull(table)

or

read_wikitables(url) %>%
  unnest(table) %>%
  filter(table = 3)
niannucci commented 4 years ago

It's been a while, so I don't remember my exact thinking, but I believe that I wrote a single function for each that was not to be exported, and then just mapped that onto every dataframe from the Wikipedia page; so the single functions are just for internal use, and the user is not deciding which- the single function is not an option for the user. Does it make sense to leave it this way, or are you saying we should structure this differently?

beanumber commented 4 years ago

Take a look at what I did in read_wikitables2(). Maybe it should be called read_wikitbles_dfr() as analogy to map_dfr()