MatthewSmith430 / CompaniesHouse

This package allows to extract data from the Companies House API and create interlocking directorates networks
29 stars 4 forks source link

Functions not working #5

Closed ndoyle-1 closed 3 years ago

ndoyle-1 commented 4 years ago

Is there a plan to update this package? Many of the functions don't seem to work and it would be very useful for a project I'm working on.

1lliter8 commented 4 years ago

Take a look in the R folder - lots of the functions that aren't working in the README are actually under a different name in there. What exactly are you looking for?

ndoyle-1 commented 4 years ago

Thanks. I'm looking for a method to pull CH filling history updates into a data table which can be sorted by time and date of update and, separately, to produce a network graph from the directors' details from several companies.

1lliter8 commented 4 years ago

Funnily enough I made a pull request on a download_files fork I wrote a few days ago that does a lot of a groundwork for getting filing history updates. I could probably alter it to get what you need without too much trouble as it's not in the main repo yet.

Could you help me understand what you mean by updates exactly? There's two places we can get dates for filings and it's not as unambiguous as you might think.

One is in the metadata for each individual filing history item. Here's what the API can give you. However, what I found was that even in filing history items with descriptions like "confirmation statement made on X with updates" the two updated_at fields didn't always have any data in, though created_at usually did.

The other is in the filing history list. Here's what that bit of the API can give you. However, in there there's quite a few different dates and I'm not sure which you'd want. That line I quoted above is found in the annotations part, and there's a whole list of them here.

On the network graph I'm afraid it's not a side of the library I've used. Network stuff seems more in the maintainer's wheelhouse, but it looks like there's some DirectorNetwork functions in the R folder I mentioned.

ndoyle-1 commented 4 years ago

Thanks for this. Would items[].date under filingHistoryList do the trick? I wanted to aggregate the latest filings from several companies in a DT data table or similar where they could be searched and sorted by name or date, with column names: date, company, description, view/download (clickable). I’m not too worried about the network graph currently.

1lliter8 commented 4 years ago

items.date is actually something the the library already fetches! Use the doc_link_extract() function on a company number, it's the date column.

Description is a bit trickier, I think that's likely items[].annotations[] from the API but maybe doc_link_extract() has what you need in the doc_type column.

Getting a download link is basically why I wrote the extra functions in the pull request as it's convoluted. Because each document can be stored in several formats you need to tell the API what version you want (PDF, XHTML etc) before it'll give you a download link, and getting that info needs an extra API call. In the end I wrote doc_meta_extract() to get the formats a particular document is available in, and doc_download() to request the location and download it. I could split doc_download() into two functions so there's a single function to generate a download link. I'm not 100% sure it'll work because of the way you specify document format in the header, but I can give it a go.

ndoyle-1 commented 4 years ago

Sounds good.

1lliter8 commented 4 years ago

Turns out you can actually generate a download link with the existing library! doc_link_extract() returns a links column and you can concatenate it into a download link like this:

paste0('https://beta.companieshouse.gov.uk', link, '/document?download=1')

That usually returns a PDF, but some filings have mutiple file formats. If you wanted to handle those you could use doc_meta_extract() from my pull then format the resource_types column to return a filetype. This can then be concatenated to a format argument, like this:

file_ext <- sub('^(.*[\\/])', '', doc_type, perl = TRUE)
file_ext <- sub('(\\+.*)$', '', file_ext, perl = TRUE)
paste0('https://beta.companieshouse.gov.uk', link, '/document?download=1', '&format=', file_ext)
ndoyle-1 commented 4 years ago

Sounds promising...

MatthewSmith430 commented 3 years ago

The readme has been updated, with the new function names.