CUB-Libraries-CTA / counter-data-loader

Loads COUNTER database from JR1 report spreadsheets
1 stars 2 forks source link

Pre-process excel file to determine if platform references are not yet defined (3) #56

Closed ericnienhouse closed 1 year ago

ericnienhouse commented 1 year ago

Problem: To identify errors in excel input file and entire load workflow is necessary. This is time consuming. It would be helpful to know in advance if a platform reference is missing in advance.

Estimate: 3

Acceptance Criteria:

Improve the pre-processing step to identify missing platform references (and not rename the file thereby preventing further processing). Consider: Use same lookup sql approach as the database loading/insert uses to find associated platform.

bonnland commented 1 year ago

This is potentially a side issue that doesn't involve "Platform" validation directly, but it seemed like a place to discuss what we should consider a "valid" input file.

One of the input files (see log) has many rows with no "Publisher" value. At first, I thought it was a required field because most input files have no missing value for this column. But based on one input file with many missing Publisher values, it makes me wonder if this field is considered a Required field.

I've left a message with Vida in Teams to see if she has a viewpoint on this. The error message looks like this:

TR_B3 2021_EBSCO.xlsx:  is missing one of (Title, Publisher, Platform) on these rows: [ 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 695 696 697 815 816 817 854 855 856 1133 1134 1135 1145 1146 1147 1211 1212 1213 1325 1326 1327 1595 1596 1597 1604 1605 1606 1730 1731 1732 1931 1932 1933 2270 2271 2272 2473 2474 2475 3151 3152 3153 3250 3251 3252 3499 3500 3501 4276 4277 4278 4726 4727 4728 4933 4934 4935 4942 4943 4944 5158 5159 5160 5671 5672 5673 5887 5888 5889 6064 6065 6066 7138 7139 7140 7450 7451 7452 7909 7910 7911 8233 8234 8235 8539 8540 8541 8542 8543 8544 8782 8783 8784 8899 8900 8901 8917 8918 8919 9046 9047 9048 9052 9053 9054 9079 9080 9081 9169 9170 9171 9262 9263 9264 9292 9293 9294 9295 9296 9297 9331 9332 9333 9334 9335 9336 9433 9434 9435 9562 9563 9564 10429 10430 10431 10708 10709 10710 10828 10829 10830 11521 11522 11523 11854 11855 11856 11863 11864 11865 12202 12203 12204 12403 12404 12405 12556 12557 12558 12712 12713 12714 12784 12785 12786 13114 13115 13116 13126 13127 13128 13324 13325 13326 13861 13862 13863 13990 13991 13992 14092 14093 14094 14413 14414 14415 14689 14690 14691 14782 14783 14784 15124 15125 15126 15283 15284 15285 16519 16520 16521 16645 16646 16647 16720 16721 16722 17104 17105 17106 17230 17231 17232 17284 17285 17286 17368 17369 17370 17482 17483 17484 18013 18014 18015 18478 18479 18480 18535 18536 18537 18928 18929 18930 18937 18938 18939 19897 19898 19899 19984 19985 19986 20581 20582 20583 20821 20822 20823 20962 20963 20964 20983 20984 20985 20992 20993 20994 21031 21032 21033 21076 21077 21078 21364 21365 21366 21778 21779 21780 21781 21782 21783 21796 21797 21798 21799 21800 21801 22102 22103 22104 22249 22250 22251 22414 22415 22416 22576 22577 22578 22741 22742 22743 22804 22805 22806 23098 23099 23100 23404 23405 23406 23470 23471 23472 23677 23678 23679 23743 23744 23745 24292 24293 24294 24601 24602 24603 24733 24734 24735 25084 25085 25086 25123 25124 25125 25126 25127 25128 25129 25130 25131 25132 25133 25134 25441 25442 25443 25444 25445 25446 25465 25466 25467 25873 25874 25875 26260 26261 26262 26422 26423 26424 27268 27269 27270 27283 27284 27285 27430 27431 27432 27754 27755 27756 27766 27767 27768 28021 28022 28023 28048 28049 28050 28099 28100 28101 28828 28829 28830 28972 28973 28974 29893 29894 29895 30910 30911 30912 31099 31100 31101 31234 31235 31236 31513 31514 31515 31537 31538 31539 31552 31553 31554 31555 31556 31557 31570 31571 31572 31618 31619 31620 31663 31664 31665 32722 32723 32724 32797 32798 32799 33031 33032 33033 33649 33650 33651 33658 33659 33660 33691 33692 33693 33856 33857 33858 34609 34610 34611 34789 34790 34791 34801 34802 34803 34885 34886 34887 34948 34949 34950 34957 34958 34959 35080 35081 35082 35170 35171 35172 35332 35333 35334 35845 35846 35847 36025 36026 36027 36091 36092 36093 36334 36335 36336 36685 36686 36687 36946 36947 36948 37036 37037 37038 37069 37070 37071 37765 37766 37767 38089 38090 38091 38251 38252 38253 38353 38354 38355 38953 38954 38955 38965 38966 38967 39241 39242 39243 39376 39377 39378 39736 39737 39738 39793 39794 39795 40033 40034 40035 40057 40058 40059 40066 40067 40068 40084 40085 40086 40219 40220 40221 40264 40265 40266 40369 40370 40371 40756 40757 40758 40912 40913 40914 41083 41084 41085 42091 42092 42093 42181 42182 42183 42694 42695 42696 42763 42764 42765 43120 43121 43122 43123 43124 43125 43747 43748 43749 43843 43844 43845 43864 43865 43866 43873 43874 43875 44470 44471 44472 44692 44693 44694 44887 44888 44889 44941 44942 44943 45052 45053 45054 45352 45353 45354 45469 45470 45471 45472 45473 45474 45571 45572 45573 45580 45581 45582 45799 45800 45801 45823 45824 45825 45949 45950 45951 46009 46010 46011 46036 46037 46038 46126 46127 46128 46675 46676 46677 46801 46802 46803 47197 47198 47199 47710 47711 47712 47893 47894 47895 47956 47957 47958 48391 48392 48393 48481 48482 48483 48853 48854 48855 49135 49136 49137 49231 49232 49233 49393 49394 49395 49405 49406 49407 49627 49628 49629 49972 49973 49974 50134 50135 50136 50980 50981 50982 51499 51500 51501 52108 52109 52110 52162 52163 52164 52213 52214 52215 52582 52583 52584 52741 52742 52743 53044 53045 53046 53053 53054 53055 53383 53384 53385 53398 53399 53400 54343 54344 54345 54931 54932 54933 54964 54965 54966 55102 55103 55104 55603 55604 55605 55723 55724 55725 56359 56360 56361 56386 56387 56388 56824 56825 56826 56833 56834 56835 56929 56930 56931 56953 56954 56955 57007 57008 57009 57184 57185 57186 57457 57458 57459 57961 57962 57963 57979 57980 58026 58027 58028 58785 58786 58787 58791 58792 58793 58794 58795 58796 58803 58804 58805 58995 58996 58997 59199 59200 59201 59202 59203 59204 59205 59206 59207 59676 59677 59678 59802 59803 59804 59907 59908 59909 59973 59974 59975 59982 59983 59984 60228 60229 60230 60630 60631 60632 60633 60634 60635 60636 60637 60638 60642 60643 60644 60645 60646 60647 60648 60649 60650 60651 60652 60653 60654 60655 60656 60657 60658 60659 60660 60661 60662 60663 60664 60665 60666 60667 60668 60669 60670 60671 60672 60673 60674 60675 60676 60677 60678 60679 60680 60681 60682 60683 60684 60685 60686 60687 60688 60689 60690 60691 60692 60693 60694 60695 60696 60697 60698 60699 60700 60701 60702 60703 60704 60705 60706 60707 60708 60709 60710 60711 60712 60713 60714 60715 60716 60717 60718 60719 60720 60721 60722 60723 60724 60725 60726 60727 60728 60729 60730 60731 60732 60733 60734 60735 60736 60737 60738 60739 60740 60741 60742 60743 60744 60745 60746 60768 60769 60770 60840 60841 60842 61254 61255 61256 61317 61318 61319 61350 61351 61352 61365 61366 61367 61482 61483 61484 61596 61597 61598 61857 61858 61859 62298 62299 62300 62484 62485 62486 62733 62734 62735 63954 63955 63956 64470 64471 64472 64572 64573 64574 64674 64675 64676 64809 64810 64811 64932 64933 64934 65805 65806 65807 65955 65956 65957 66051 66052 66053 66363 66364 66365 66366 66367 66368 66414 66415 66416 66696 66697 66698 66717 66718 66719 66927 66928 66929 66993 66994 66995 67119 67120 67121 67320 67321 67322 67761 67762 67763 68172 68173 68174 68175 68176 68177 ]
bonnland commented 1 year ago

Attached in errors.log are the preprocessing errors for most of the Excel input files for 2021. Excluded from this run were a handful of files that immediately failed to open because they were not in the correct Excel spreadsheet format, or the relevant Sheet in the spreadsheet was not the first in the file.

Summary of Results:

I will be asking Vida for feedback about whether it's better to report missing values for Publisher, or ignore them.

errors.log