NYPL / ami-tools

MIT License
16 stars 6 forks source link

pamidb_to_json period-related data drop #79

Closed bturkus closed 2 years ago

bturkus commented 2 years ago

Reported by Casey Mcnamara, it appears that pandas/the pamidb_to_json script is cutting off some CSV data in various fields when transforming from CSV to individual JSONs, apparently only when those fields include period/dot/full stop characters. Per Casey:

Periods in text fields, plus any text that follows the periods, are cut off. Example - 632804 (bibliographic title field):

json is

"Christopher Scott: interview (pt"

but the full title in CMS is

"Christopher Scott: interview (pt. 2) (8-1-89) -- John Brockmeyer: interview (pt. 1) (8-2-89)"

Attaching a Filemaker merge/CSV for reference/testing. Attempted to figure this out on my own but couldn't get far, though if there's no easy fix, I'd be happy to revisit an alt version of this script that I've been kicking around for the past few months.

Also think the issue is arising in the flat_dict section of ami_md.json, though I can't really tell why pandas is behaving in this manner. Any help would be very appreciated, as this is causing ongoing JSON problems that we'd like to resolve as quickly as possible.

Thanks,

Ben

cgmcnamara commented 2 years ago

This is really a shot in the dark but I was looking at ami-tools/ami_md/ami_json.py and noticed this on lines 103-105:

  def coerce_strings(self):
    for key, item in self.dict["bibliographic"].items():
      self.dict["bibliographic"][key] = str(item).split('.')[0]

Could this possibly be contributing?

bturkus commented 2 years ago

absolutely! @nkrabben could you remind us why this was added in the first place? It totally escapes me...

nkrabben commented 2 years ago

The reason for this is that fields like CMS ID were sometimes read as numbers from the spreadsheets, and they need to be strings.

# number input
123456.0
123456
# desired output
'123456'

The function I made to fix this was way too aggressive. Patch coming in #81.

Let me know if it works as expected for you. I tested on the .mer you sent

bturkus commented 2 years ago

tested and working and very happy. thank you!