Closed ericnienhouse closed 1 year ago
I've had some problems with database permissions.
Unfortunately, the mySQL permissions model is completely new to me. Through trial and error, I learned:
GRANT SUPER ON *.* TO 'brbo3303'@'localhost'
was not enough to allow a mysqlimport command to succeed. I had to switch to the "root" user to get past this permission issue./etc/my.cnf
(on macOS at least) and must contain this configuration text:
[mysqld]
secure_file_priv = ""
Now that the database permissions problems are hopefully in the rearview mirror, I'm still not getting the workflow to run cleanly on my local machine.
I'm stepping through the code, getting more understanding of what Fred was trying to do with his python code. The code has a bit of a non-standard feeling to it. But I don't foresee a problem eventually understanding what he was trying to do.
For example, he has created his own custom parsing method for creating CSV files and mapping them to database table columns. In principle, this can work, but generally custom parsing methods are not preferred to Object-Relational Mapping tools for translating between database objects and corresponding data structures in software. It can be harder to follow the code and debug problems when custom parsing methods are used.
This is the code that extracts Excel row values and exports them to a file called title_report_temp
:
for row in self._worksheet.iter_rows(min_row=min(datarows), min_col=1, max_row=max(datarows), max_col=11):
# Start building a list of field values. The actual fields and their sequence
# must correspond to the title_report_temp table. See schema for details.
datarow = list()
datarow.append('null') # id
i = 0
while i < len(row):
if i == 1:
datarow.append(self.title_type) # title_type
if i == 6 and self.title_type == 'J':
datarow.append('') # isbn
if i == 9 and self.title_type == 'J':
datarow.append('') # yop
break
if row[i].value is None:
datarow.append('')
else:
datarow.append(str(row[i].value).strip())
i += 1
datarow.append(self._filename) # excel_name
datarow.append(row_num) # row_num
datarow.append(0) # title_report_id
csvwriter.writerow(datarow)
row_num += 1
When this code is run on the file tr-j3-acm-2021-0112.xlsx
, the CSV file title_report_temp has rows that look like this:
null ACM Communications in Computer Algebra J Association for Computing Machinery acm:PUB27 acm 10.1145/SIGSAM-CCA acm:sigsam-cca 1932-2240 tr-j3-acm-2021-0112.xlsx 15 0
null ACM Communications in Computer Algebra J Association for Computing Machinery acm:PUB27 acm 10.1145/SIGSAM-CCA acm:sigsam-cca 1932-2240 tr-j3-acm-2021-0112.xlsx 16 0
null ACM Communications in Computer Algebra J Association for Computing Machinery acm:PUB27 acm 10.1145/SIGSAM-CCA acm:sigsam-cca 1932-2240 tr-j3-acm-2021-0112.xlsx 17 0
null ACM Communications in Computer Algebra J Association for Computing Machinery acm:PUB27 acm 10.1145/SIGSAM-CCA acm:sigsam-cca 1932-2240 tr-j3-acm-2021-0112.xlsx 18 0
null ACM Computing Surveys (CSUR) J Association for Computing Machinery acm:PUB27 acm 10.1145/CSUR acm:csur 0360-0300 1557-7341 tr-j3-acm-2021-0112.xlsx 19 0
null ACM Computing Surveys (CSUR) J Association for Computing Machinery acm:PUB27 acm 10.1145/CSUR acm:csur 0360-0300 1557-7341 tr-j3-acm-2021-0112.xlsx 20 0
null ACM Computing Surveys (CSUR) J Association for Computing Machinery acm:PUB27 acm 10.1145/CSUR acm:csur 0360-0300 1557-7341 tr-j3-acm-2021-0112.xlsx 21 0
null ACM Computing Surveys (CSUR) J Association for Computing Machinery acm:PUB27 acm 10.1145/CSUR acm:csur 0360-0300 1557-7341 tr-j3-acm-2021-0112.xlsx 22 0
But when it comes time to import the CSV into the database table title_report_temp
, the following command fails:
mysqlimport --user=root_user --password=password --delete brian_counter /tmp/title_report_temp /tmp/metric_temp > mysqlimport_out.txt
with the error. mysqlimport: Error: 1366, Incorrect integer value: 'null' for column 'id' at row 1, when using table: title_report_temp
But when it comes time to import the CSV into the database table
title_report_temp
, the following command fails:mysqlimport --user=root_user --password=password --delete brian_counter /tmp/title_report_temp /tmp/metric_temp > mysqlimport_out.txt
with the error.
mysqlimport: Error: 1366, Incorrect integer value: 'null' for column 'id' at row 1, when using table: title_report_temp
I just figured out that the MySql server has a "strict mode" setting that is causing the import problem.
The table title_report_temp
has auto_increment turned on for the 'id' field, but "strict mode" turned on will still cause the above error until it is turned off.
So to turn off "strict mode", the mySql server settings file /etc/my.cnf
should look like this:
[mysqld]
secure_file_priv = ""
sql_mode= ""
The sql_mode setting turns off "strict" mode, and allows the mysqlimport command to work without error.
I was able to reproduce the "Shifted Columns" error for the input file tr-b3-proquest-2021-0112.xlsx
.
Updating the database with around 6,000 new rows, with a metric table with 551,000 rows, took 5-10 seconds on my local laptop.
Stack trace:
mysqlimport --user=**** --password=***** --delete brian_counter /Users/Shared/AmazonMirror/workingFiles/title_report_temp /Users/Shared/AmazonMirror/workingFiles/metric_temp > mysqlimport_out.txt
Traceback (most recent call last):
File "/Users/brbo3303/PycharmProjects/counter-data-loader/loader.py", line 71, in <module>
report.export()
File "/Users/brbo3303/PycharmProjects/counter-data-loader/dataloader/tmreport.py", line 242, in export
period_total = int(float(row[n].value)) # float conversion deals with cases of '0.0'
TypeError: float() argument must be a string or a number, not 'NoneType'
I was able to confirm that after deleting the ProQuest rows with issues, the loader.py
workflow re-tried and succeeded in loading the rows from tr-b3-proquest-2021-0112.xlsx
Interestingly, the ProQuest Excel sheet had some dangling rows that didn't seem to cause a problem with loading from Excel:
Actually, the dangling rows did cause an error. The Excel files have very similar names, with two ProQuest files being loaded. When I thought the originally modified ProQuest file ran without errors, it was actually the other ProQuest input file that apparently did not have errors.
NOTE: The python function that grabs all Excel files returns them in different orders each time. The loader will not always restart with the last input file that had errors.
Estimate: 5
Acceptance Criteria:
Run end-to-end COUNTER loader workflow in local development environment. Run as appropriate user & mysql configuration environment. Consider happy path and "bad" input data examples. Create new stories for problems encountered.
Note: Workflow docs: https://cu-boulder-libraries.readthedocs.io/en/latest/pages/counter.html