Closed ndoyle-1 closed 3 years ago
Take a look in the R folder - lots of the functions that aren't working in the README are actually under a different name in there. What exactly are you looking for?
Thanks. I'm looking for a method to pull CH filling history updates into a data table which can be sorted by time and date of update and, separately, to produce a network graph from the directors' details from several companies.
Funnily enough I made a pull request on a download_files
fork I wrote a few days ago that does a lot of a groundwork for getting filing history updates. I could probably alter it to get what you need without too much trouble as it's not in the main repo yet.
Could you help me understand what you mean by updates exactly? There's two places we can get dates for filings and it's not as unambiguous as you might think.
One is in the metadata for each individual filing history item. Here's what the API can give you. However, what I found was that even in filing history items with descriptions like "confirmation statement made on X with updates" the two updated_at
fields didn't always have any data in, though created_at
usually did.
The other is in the filing history list. Here's what that bit of the API can give you. However, in there there's quite a few different dates and I'm not sure which you'd want. That line I quoted above is found in the annotations
part, and there's a whole list of them here.
On the network graph I'm afraid it's not a side of the library I've used. Network stuff seems more in the maintainer's wheelhouse, but it looks like there's some DirectorNetwork
functions in the R folder I mentioned.
Thanks for this. Would items[].date under filingHistoryList do the trick? I wanted to aggregate the latest filings from several companies in a DT data table or similar where they could be searched and sorted by name or date, with column names: date, company, description, view/download (clickable). I’m not too worried about the network graph currently.
items.date
is actually something the the library already fetches! Use the doc_link_extract()
function on a company number, it's the date
column.
Description is a bit trickier, I think that's likely items[].annotations[]
from the API but maybe doc_link_extract()
has what you need in the doc_type
column.
Getting a download link is basically why I wrote the extra functions in the pull request as it's convoluted. Because each document can be stored in several formats you need to tell the API what version you want (PDF, XHTML etc) before it'll give you a download link, and getting that info needs an extra API call. In the end I wrote doc_meta_extract()
to get the formats a particular document is available in, and doc_download()
to request the location and download it. I could split doc_download()
into two functions so there's a single function to generate a download link. I'm not 100% sure it'll work because of the way you specify document format in the header, but I can give it a go.
Sounds good.
Turns out you can actually generate a download link with the existing library! doc_link_extract()
returns a links
column and you can concatenate it into a download link like this:
paste0('https://beta.companieshouse.gov.uk', link, '/document?download=1')
That usually returns a PDF, but some filings have mutiple file formats. If you wanted to handle those you could use doc_meta_extract()
from my pull then format the resource_types
column to return a filetype. This can then be concatenated to a format argument, like this:
file_ext <- sub('^(.*[\\/])', '', doc_type, perl = TRUE)
file_ext <- sub('(\\+.*)$', '', file_ext, perl = TRUE)
paste0('https://beta.companieshouse.gov.uk', link, '/document?download=1', '&format=', file_ext)
Sounds promising...
The readme has been updated, with the new function names.
Is there a plan to update this package? Many of the functions don't seem to work and it would be very useful for a project I'm working on.