GSS-Cogs / metadata-indexer

The metadata indexer will provide services to catalogue statistical publications from across government and surface in a centralised database.
Apache License 2.0
0 stars 0 forks source link

beis scraper required #2

Open mikeAdamss opened 3 years ago

mikeAdamss commented 3 years ago

need to be able to scrape datasets from https://naei.beis.gov.uk/

mikeAdamss commented 3 years ago

@CharlesRendle - is this the one we chatted about where you're waiting on BEIS for something, if so can you comment this up with whatever the deal is please? cheers.

CharlesRendle commented 3 years ago

The one dataset to transform from a naei.beis landing page is: https://github.com/GSS-Cogs/family-climate-change/tree/master/datasets/BEIS-GHG-activities

This dataset's landing page has conflicting metadata. E.g. Could take the dataset title as UK emissions data selector - Defra, UK or UK emissions data selector - NAEI, UK This then has implications on publisher and there are 2 conflicting meta tags Publisher and Creator which yield different departments.

Also landing page is missing an issued date but contains a modified date.

BEIS has said that because the dataset began being populated in 1990 the continued updates have fallen under both DEFRA and BEIS since then.

Finally, we are also aware that the data contained in this dataset has already been represented by a separate previously transformed dataset in the EDV family. But have been informed by BEIS that the NAEI dataset (for which this scraper will be needed) is due for an update towards the end of July 2021 at which point we shall reassess the value added by the update.

Our position until this update is to have the scraper/transform on-hold. I have written the scraper to retrieve all the metadata I could with permutations for which publisher we decide to use but it lacks validation for the moment.