CUB-Libraries-CTA / counter-data-loader

Loads COUNTER database from JR1 report spreadsheets
1 stars 2 forks source link

Run COUNTER Data Workflow in Development Environment (5) #46

Closed ericnienhouse closed 1 year ago

ericnienhouse commented 2 years ago

Estimate: 5

Acceptance Criteria:

Run end-to-end COUNTER loader workflow in local development environment. Run as appropriate user & mysql configuration environment. Consider happy path and "bad" input data examples. Create new stories for problems encountered.

Note: Workflow docs: https://cu-boulder-libraries.readthedocs.io/en/latest/pages/counter.html

bonnland commented 2 years ago

I've had some problems with database permissions.

Unfortunately, the mySQL permissions model is completely new to me. Through trial and error, I learned:

  1. The "mysqlimport" command used within the workflow requires admin permissions. I spent a long time slowly elevating permissions for my non-privileged user to see when this command would succeed. Even GRANT SUPER ON *.* TO 'brbo3303'@'localhost' was not enough to allow a mysqlimport command to succeed. I had to switch to the "root" user to get past this permission issue.
  2. By default, mySql does not allow importing and exporting table information from the filesystem. I had to create a server configuration file to enable CSV file creation and table import commands to succeed. After a lot of research, I discovered that the server configuration file must reside at /etc/my.cnf (on macOS at least) and must contain this configuration text:
    [mysqld]
    secure_file_priv = ""
bonnland commented 2 years ago

Now that the database permissions problems are hopefully in the rearview mirror, I'm still not getting the workflow to run cleanly on my local machine.

I'm stepping through the code, getting more understanding of what Fred was trying to do with his python code. The code has a bit of a non-standard feeling to it. But I don't foresee a problem eventually understanding what he was trying to do.

For example, he has created his own custom parsing method for creating CSV files and mapping them to database table columns. In principle, this can work, but generally custom parsing methods are not preferred to Object-Relational Mapping tools for translating between database objects and corresponding data structures in software. It can be harder to follow the code and debug problems when custom parsing methods are used.

bonnland commented 2 years ago

This is the code that extracts Excel row values and exports them to a file called title_report_temp:

            for row in self._worksheet.iter_rows(min_row=min(datarows), min_col=1, max_row=max(datarows), max_col=11):
                # Start building a list of field values. The actual fields and their sequence
                # must correspond to the title_report_temp table. See schema for details.
                datarow = list()
                datarow.append('null') # id
                i = 0
                while i < len(row):
                    if i == 1:
                        datarow.append(self.title_type) # title_type
                    if i == 6 and self.title_type == 'J':
                        datarow.append('') # isbn
                    if i == 9 and self.title_type == 'J':
                        datarow.append('') # yop
                        break
                    if row[i].value is None:
                        datarow.append('')
                    else:
                        datarow.append(str(row[i].value).strip())
                    i += 1
                datarow.append(self._filename) # excel_name
                datarow.append(row_num) # row_num
                datarow.append(0) # title_report_id
                csvwriter.writerow(datarow)
                row_num += 1

When this code is run on the file tr-j3-acm-2021-0112.xlsx, the CSV file title_report_temp has rows that look like this:

null    ACM Communications in Computer Algebra  J   Association for Computing Machinery acm:PUB27   acm 10.1145/SIGSAM-CCA  acm:sigsam-cca      1932-2240               tr-j3-acm-2021-0112.xlsx    15  0
null    ACM Communications in Computer Algebra  J   Association for Computing Machinery acm:PUB27   acm 10.1145/SIGSAM-CCA  acm:sigsam-cca      1932-2240               tr-j3-acm-2021-0112.xlsx    16  0
null    ACM Communications in Computer Algebra  J   Association for Computing Machinery acm:PUB27   acm 10.1145/SIGSAM-CCA  acm:sigsam-cca      1932-2240               tr-j3-acm-2021-0112.xlsx    17  0
null    ACM Communications in Computer Algebra  J   Association for Computing Machinery acm:PUB27   acm 10.1145/SIGSAM-CCA  acm:sigsam-cca      1932-2240               tr-j3-acm-2021-0112.xlsx    18  0
null    ACM Computing Surveys (CSUR)    J   Association for Computing Machinery acm:PUB27   acm 10.1145/CSUR    acm:csur        0360-0300   1557-7341           tr-j3-acm-2021-0112.xlsx    19  0
null    ACM Computing Surveys (CSUR)    J   Association for Computing Machinery acm:PUB27   acm 10.1145/CSUR    acm:csur        0360-0300   1557-7341           tr-j3-acm-2021-0112.xlsx    20  0
null    ACM Computing Surveys (CSUR)    J   Association for Computing Machinery acm:PUB27   acm 10.1145/CSUR    acm:csur        0360-0300   1557-7341           tr-j3-acm-2021-0112.xlsx    21  0
null    ACM Computing Surveys (CSUR)    J   Association for Computing Machinery acm:PUB27   acm 10.1145/CSUR    acm:csur        0360-0300   1557-7341           tr-j3-acm-2021-0112.xlsx    22  0

But when it comes time to import the CSV into the database table title_report_temp, the following command fails:

mysqlimport --user=root_user --password=password --delete brian_counter /tmp/title_report_temp              /tmp/metric_temp > mysqlimport_out.txt

with the error. mysqlimport: Error: 1366, Incorrect integer value: 'null' for column 'id' at row 1, when using table: title_report_temp

bonnland commented 2 years ago

But when it comes time to import the CSV into the database table title_report_temp, the following command fails:

mysqlimport --user=root_user --password=password --delete brian_counter /tmp/title_report_temp              /tmp/metric_temp > mysqlimport_out.txt

with the error. mysqlimport: Error: 1366, Incorrect integer value: 'null' for column 'id' at row 1, when using table: title_report_temp

I just figured out that the MySql server has a "strict mode" setting that is causing the import problem.

The table title_report_temp has auto_increment turned on for the 'id' field, but "strict mode" turned on will still cause the above error until it is turned off.

So to turn off "strict mode", the mySql server settings file /etc/my.cnf should look like this:

[mysqld]
secure_file_priv = ""
sql_mode= ""

The sql_mode setting turns off "strict" mode, and allows the mysqlimport command to work without error.

bonnland commented 1 year ago

I was able to reproduce the "Shifted Columns" error for the input file tr-b3-proquest-2021-0112.xlsx.

Updating the database with around 6,000 new rows, with a metric table with 551,000 rows, took 5-10 seconds on my local laptop.

Stack trace:

mysqlimport --user=**** --password=***** --delete brian_counter /Users/Shared/AmazonMirror/workingFiles/title_report_temp              /Users/Shared/AmazonMirror/workingFiles/metric_temp > mysqlimport_out.txt
Traceback (most recent call last):
  File "/Users/brbo3303/PycharmProjects/counter-data-loader/loader.py", line 71, in <module>
    report.export()
  File "/Users/brbo3303/PycharmProjects/counter-data-loader/dataloader/tmreport.py", line 242, in export
    period_total = int(float(row[n].value)) # float conversion deals with cases of '0.0'
TypeError: float() argument must be a string or a number, not 'NoneType'
bonnland commented 1 year ago

I was able to confirm that after deleting the ProQuest rows with issues, the loader.py workflow re-tried and succeeded in loading the rows from tr-b3-proquest-2021-0112.xlsx

bonnland commented 1 year ago

Interestingly, the ProQuest Excel sheet had some dangling rows that didn't seem to cause a problem with loading from Excel:

Screen Shot 2022-11-02 at 9 23 15 PM

bonnland commented 1 year ago

Actually, the dangling rows did cause an error. The Excel files have very similar names, with two ProQuest files being loaded. When I thought the originally modified ProQuest file ran without errors, it was actually the other ProQuest input file that apparently did not have errors.

NOTE: The python function that grabs all Excel files returns them in different orders each time. The loader will not always restart with the last input file that had errors.