CUB-Libraries-CTA / counter-data-loader

Loads COUNTER database from JR1 report spreadsheets
1 stars 2 forks source link

Add controlled vocabulary for platform names #10

Closed ghost closed 3 years ago

ghost commented 3 years ago

There are variations in platform names used by different aggregators. For example, one supplier will use ACM as the platform name while another will have it as ACM Digital Library. A controlled vocabulary will provide consistency in platform name usage, which should make it easier and improve overall clarity for end users.

The controlled vocabulary will be provided by the product team.

ghost commented 3 years ago

Final controlled vocabulary is available here.

To load platform data from the user-supplied spreadsheet requires some preprocessing to enable proper mapping and importing of the data into the database. The platform reference table (platform_ref) has five columns:

  1. id -- unique row identifier
  2. alt_id -- cross reference to the unique id in the R4 database
  3. name -- name as provided by the platform
  4. alias -- any other name(s) that may be used in lieu of the name in the name column
  5. preferred_name -- controlled vocabulary name
  6. has_faq -- flag indicating whether platform has explanatory text in the FAQ

NOTE: The alt_id is only used to migrate existing data from the R4 database to R5. Once data migration is complete, this column can be dropped.

Given the above, the original spreadsheet needs to be modified accordingly before the data can be exported to a tab delimited file (tab is preferred as the field delimiter as some platform names contain a comma). Once the data file has been prepared it can be imported into the database using the mysqlimport program. Note that the import program can be configured (in my.cnf or my.ini) to accept tabs as field terminators and a CRLF as the line terminator (for Windows-generated files only). These options can also be specified as options on the command line, e.g.,

$mysqlimport --delete --fields-terminated-by='\t' --lines-terminated-by='\r\n' --ignore-lines=1 counter5 sql/platform_ref.tsv

This command will load reference data from the specified TSV file into the platform_ref table, emptying the table before it does so. It assumes the user and password options are also specified in my.cnf/my.ini; otherwise, include -u and -p options in the command line.

platform_ref.txt