Closed bwatson78 closed 4 years ago
@fnibbit Sure. I'm working from the digital archive standards outlined here: https://guides.lib.purdue.edu/c.php?g=352889&p=2378064 .
So, while checking if I can work with the author name, I found a lot of examples of those names containing a lot of extras (dates of birth and death inside parentheses, already formatted names that fit only certain styles e.g. "Creator, Sample O.", etc.) Because the names would be difficult to format when they have all these extras, I made a test to see if any numbers or special characters were there and, if so, just spit out the names joined by the right style joiner. But, if that test finds no special characters, it would properly format the name for the associated style (sanitized_*_auth
). The `auth_no_per" method is there because I found a lot of Authors' names that ended with a period, causing some of their names to appear like this "Creator., S. P.", so that scrubs that.
Do the examples you've looked at cover what happens with diacritics (François for instance)?
Thanks for pointing that out. I tested "François \/, " with /[^\p{L}\s]+/
and it is matching only what I'm wanting. I'll update the regex and run rspec
. Anything else to change?
Names are slippery. (An old favorite article: https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-believe-about-names/) Things that aren't letters that are in people's names sometimes: numerals (journalist Jennifer 8. Lee for instance), apostrophes (Peter O'Toole), hyphens (Mountbatten-Windsor). I feel like there might be name-manipulation test suites out there?
I did take a look for name-manipulation test suites in ruby and didn't find much. If you had some in mind, I'd definitely take a look at them. I agree with you that it's hard to discern between "Mountbatten-Windsor" and "[1920-1999]".
My thinking behind this is to take the extra step for the user to deliver a properly formatted author name whenever I have a chance. Since my code is only formatting names when the CiteProc Gem errs, I thought it best to fall back on purely returning the tesim field whenever it doesn't come through perfectly (meaning, the presence of special characters and numbers that would show up in odd places, as well as commas that would imply that the name has already been formatted.)
Also, bear in mind that this is how the CiteProc gem is formatting document 634sj3txf8-cor:
As you can see, the extra steps I'm taking are more than the Gem is doing.
I'm sorry, I'd gotten lost and misinterpreted both your code and your explanation (my misinterpretation was far more interventionist that what I now think it's doing). I'd like to come back to it with fresh eyes tomorrow, maybe pair so you can talk me through it -- or if someone else is ready to review, that's cool too.
@fnibbit The method names have been updated.