claroty / access_parser

A Python based parser for Microsoft Access database files
Apache License 2.0
70 stars 17 forks source link

Incorrect parsing of IPEDS database #6

Closed adruzenko03 closed 3 years ago

adruzenko03 commented 3 years ago

Hello, I am working on a javascript program that needs to parse accessDB files and used this extension, which used this project and turned it into a node package. I used this extension in order to try and parse through the IPEDS database (Download Link: Here).

The problem code parses through the HD2018 table in the IPEDS201819 database

Upon trying to parse I get a huge amount of error statements in the following format

Memo data inline
Parsing memo field ♂���077227858
Memo data inline
Parsing memo field ↔���Concordia University Irvine
Memo data inline
Parsing memo field ♂���076084946
Memo data inline
Parsing memo field ☻�
Memo data inline

As the table is massive, I went and deleted some print statements for the file to find if any other errors were called and it also sent the following errors

Overflow record flag is not present 2990
LVAL type 1
Overflow record flag is not present 2848
LVAL type 1
Overflow record flag is not present 2776
LVAL type 1
Overflow record flag is not present 2572
LVAL type 1
Failed to parse memo field. Using data as bytes
Failed to parse memo field. Using data as bytes
Failed to parse memo field. Using data as bytes
Failed to parse memo field. Using data as bytes
Failed to parse memo field. Using data as bytes
Failed to parse memo field. Using data as bytes

The result of the parsing yielded the following

[
  '100654',
  '1',
...Everything fine between here...
  '119',
  '1',
  'Alabama A & M University',
  'AAMU',
  '4900 Meridian Street',
  'Normal',
  'AL',
  '35762Dr. Andrew Hugine, Jr.President2563725000636001109 \x0B耀\x00\x00\x00\x00㤱㈷㘱㔴〵 ㄀  ㈀    眀眀眀⸀愀愀洀甀⸀攀搀甀⼀眀眀眀⸀愀愀洀甀⸀攀搀甀⼀䄀 
搀洀椀猀猀椀漀渀猀⼀倀愀最攀猀⼀搀攀昀愀甀氀琀⸀愀猀瀀砀眀',
  '',
  'President',
  '2563725000',
  '636001109 ',
  '197216455',
  '00100200  ',
  'www.aamu.edu/',
  'www.aamu.edu/Admissions/Pages/default.aspxwww.aamu.edu/admissions/fincialaid/pages/default.aspxhttps://www.aamu.edu/Admissions/UndergraduateAdmissions/Pages/Apply%20Today',
  '',
  'https://www.aamu.edu/Admissions/UndergraduateAdmissions/Pages/Apply%20Today.aspxhttps://galileo.aamu.edu/NetPriceCalculator/npcalc.htm  www.aamu.edu/administrativeoffices/VADS/Pages/Disability-Services.aspxA-',
  '',
  ' ',
  ' ',
  'www.aamu.edu/administrativeoffices/VADS/Pages/Disability-Services.aspx',
  'A-2        -2                                                                              -2Madison County\x80\x00\x00\x00\x00\x00㘀⣮\x05\x00\x00\x00\x00\x00\x00삈ȒӭӜӋүҫ',
  '',
  '-2                                                                              ',
  '-2',
  'Madison County',
  '�\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x006�(\x05',
  ''
]

As mentioned, these are simply snip bits of the errors, as this error was thrown for what seemed like nearly every line of these multiple thousand line database. I posted this as an issue on the node project and was sent to here as this seems to be a logic issue. Please tell me if anymore information is required, or if this is a duplicate issue and i didn't realize. Alex

ur1katz commented 3 years ago

Hello, Thanks for opening the issue, it revealed multiple issues in the parsing flow. The following fixes(already in master) should resolve all issues with the IPEDS201819 database: