New data structure, preparing database move

davemacdo commented 6 years ago

Background

Currently, the data is in a Google Sheet. I use the Sheets API manually (silly to use an API manually, I know) to pull the content as JSON and check it into the repo here from time to time. The spreadsheet is maintained by a fantastic group of researchers and informed by user submissions in a Google Form.

The sheet uses a series of columns that are used as booleans ('X' for true, empty string for false). This is pretty easy for the researchers to handle, but makes for a lot of fields per record.

New format

I would like to convert the data to a different format before moving it elsewhere. Here's my current thinking on what we need to have:

firstname (string)
surname (string)
surnamefirst (boolean), determines if the composer is from a culture where surnames are listed before given names. They would still be alphabetized by surname, like in a library database
deathyear (string)
gender (string)
genre (array of strings)
subgenre (array of strings)
demographic (array of strings)
locationshort (string), city and state if available, abbreviated where appropriate
locationlong (string), city and state if available, may be identical to locationshort if there is no standard abbreviation
country (string)

This moves from 40+ fields per record to 11. Even just making this change and keeping the data where it is could shrink the flat file enough to improve performance. Thoughts?

captbaritone commented 6 years ago

So, is this task just to define a map function which converts the existing format to the new format?

davemacdo commented 6 years ago

That will be next. First, I wanted to see if anybody had a better idea for structuring the data. A map function will come next.

davemacdo commented 6 years ago

I started working on a function to parse the names in the current combined name field into parts and found some edge cases we'll need to make a determination on how to handle.

There are some composers who have multiple names. These could be name changes due to marriage, stage names, gender transitions, etc. To be most useful as a research database, I think it's important to keep this information and make sure it's searchable. The current spreadsheet doesn't treat them consistently, so there will need to be some minimal manual adjustments (there are only a handful of these at the moment) anyway.

My idea would be to add a field for alternate names as an array of strings, which I would prefer to list as firstname surname, since some of these may be stage names that have formatting that should be maintained (for example, it would be silly to write "Z, Pamela"). Does this seem like a practical way to store/retrieve this information? @robdeemer?

robdeemer commented 6 years ago

That looks good to me, David!

On Sat, Jun 9, 2018 at 10:07 PM, David MacDonald notifications@github.com wrote:

I started working on a function to parse the names in the current combined name field into parts and found some edge cases we'll need to make a determination on how to handle.

There are some composers who have multiple names. These could be name changes due to marriage, stage names, gender transitions, etc. To be most useful as a research database, I think it's important to keep this information and make sure it's searchable. The current spreadsheet doesn't treat them consistently, so there will need to be some minimal manual adjustments (there are only a handful of these at the moment) anyway.

My idea would be to add a field for alternate names as an array of strings, which I would prefer to list as firstname surname, since some of these may be stage names that have formatting that should be maintained (for example, it would be silly to write "Z, Pamela"). Does this seem like a practical way to store/retrieve this information? @robdeemer https://github.com/robdeemer?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/davemacdo/composer-diversity/issues/31#issuecomment-396013825, or mute the thread https://github.com/notifications/unsubscribe-auth/AlCxlf0cPVpymFxj6b_Uf28kFUllW6vEks5t7H9GgaJpZM4UgAsn .

--

Dr. Rob Deemer Head of Music Composition & Special Assistant to the President American Council on Education Fellow (2016-2017) School of Music, State University of New York at Fredonia

Chair, NYSSMA Composition & Improvisation Committee Chair-Elect, NAfME Council of Music Composition

Composer-in-Residence, Buffalo Chamber Players http://buffalochamberplayers.org/ Composer-in-Residence, Harmonia Chamber Singers https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CB4QFjAAahUKEwjmv7jhh-3IAhWLdD4KHSr2CNM&url=http%3A%2F%2Fwww.harmoniacs.org%2F&usg=AFQjCNGgWsCqj4leupJVTtIhJNwYXY26nw&sig2=TKTS2XdNhV0m9B3O3b0-jg

3162 Mason Hall • Fredonia, NY 14063 • (716) 673-3133 e-mail: deemer@fredonia.edu • web: www.robdeemer.com

gtwright commented 6 years ago

What about doing a little more grouping and going to something like this? Name and location could both become objects, to keep the top level even cleaner. And perhaps "years" as grouping with birth and death within.

name (object) --firstname (string) --surname (string) --surnamefirst (boolean), determines if the composer is from a culture where surnames are listed before given names. They would still be alphabetized by surname, like in a library database
years (object) -- birthyear (string) -- deathyear (string)
gender (string)
genre (array of strings)
medium (array of strings)
demographic (array of strings)
location (object) -- locationshort (string), city and state if available, abbreviated where appropriate -- locationlong (string), city and state if available, may be identical to locationshort if there is no standard abbreviation -- country (string) -- in USA (boolean)

I'm also keen on the idea of geocoding location with lat/lng coordinates, but that would make the import more difficult. (Might not be a dealbreaker, but it does make it tougher)

davemacdo commented 6 years ago

I like the idea of name and location as objects. We're not currently collecting birth years in the research, but there's no reason it couldn't have a field if that ever changes. I have also been thinking about geocoding locations, as I would love to be able to search for "within x miles of y". There are some tough problems associated with that: some records have no location data, records have different levels of precision (city, state, country), and some records have multiple locations listed. I think it's worth setting up something. Perhaps we could write a function that would update lat/lng any time the location object changes? That way the data is there, but not calculated dynamically for each query. (It also makes it easier to add later once we figure out the best way to generate it from the data.)

Oh, and I'm a dummy and forgot the URL field in my initial post. My changes bolded:

name (object) --firstname (string) --surname (string) --surnamefirst (boolean), determines if the composer is from a culture where surnames are listed before given names. They would still be alphabetized by surname, like in a library database -- aka (array of strings)
url (string)
years (object) -- birthyear (string) -- deathyear (string)
gender (string)
genre (array of strings)
medium (array of strings)
demographic (array of strings)
location (object) -- locationshort (string), city and state if available, abbreviated where appropriate -- locationlong (string), city and state if available, may be identical to locationshort if there is no standard abbreviation -- country (string) -- in USA (boolean) -- coords (string? array of strings? object of floats?)

Let's punt on the geocoding for now, but put it on the roadmap for sure.

ianring commented 6 years ago

misc thoughts: "in USA" is redundant if we are storing country?

Gender, Genre, Medium, Demographic - I get that the data will be stored as a string, but let's make sure they are treated like foreign keys, not free text; ie there should be a canonical list of each, and the keys will all be present in the i18n lexica.

Also, just an observation, this data structure is PERFECT for a Mongo database. https://www.mongodb.com/

gtwright commented 6 years ago

I had a similar thought about "in USA" but I left it in since I noticed at least one person that had two countries listed in their location field, so I wasn't sure if we could count on formatting being consistent. I'm all for eliminating unnecessary data entry, but maybe it could live on as a convenience method that is dynamically generated based on the location.

And on a related note: should location be an array of objects to handle people who are associated with multiple locations?

Totally with you that this screams out Mongo. It sounds like there's some leaning towards Wordpress. If this goes the Wordpress route, I was assuming gender, genre, medium, demographics would all be custom taxonomies. I haven't worked with i18n, so I don't know if that would work.

ianring commented 6 years ago

I haven't been following the discussion about Wordpress, but I would resist forcing this data into WP's schema.

My vote would be for a REST API with an agnostic front-end that will answer search requests from anywhere.

davemacdo commented 6 years ago

I like the idea of ditching "in USA" in favor of a function, and I agree with @ianring on drawing those strings from a limited set. I imagine we could just control this from the application end. Using the keys from the string localization should make lots of things easier.

I also like @gtwright's idea of making location into an array to handle multiples a little more nicely.

The WP discussion was based on making the tool as easy and friendly as possible to the research team who will be maintaining the data. WP gives us user management tools and a possibly familiar UI, but Ian makes a good point about the schema. And I think I mentioned before that we might end up spending as much time tweaking WP as we would making something from scratch(ish). Ian, I assume you're imagining the API in PHP on the server? Reading everything from an API is a good goal whichever way we choose, and a headless WP loses some of the benefits of using WP over other options.

For the database platform, I'm totally ok with either WP or a Mongo database. I'm learning pretty much every piece of this as we go, so I'll defer to you all.

gtwright commented 6 years ago

I'm totally on board for an agnostic front-end. 👍

For the back-end, I'd say it's all about what'll be easy for the research team (who I don't know at all) and how much time and/or money to get there. Given that the data set is relatively straight forward, I think there are probably several CMSs that would do the trick—and that building something from scratch wouldn't be too onerous.

The big reason I still lean towards Wordpress is that it is such a known commodity, both for the research team (yay for less time on training and tech support) and for estimating project complexity on our end. And an extra bonus is Wordpress doesn't scale in cost per user, like many other CMSs.

All that being said, I'm pretty open minded about solutions. I just don't want to sink time into building things like user auth or answering UI questions for a research team if we don't have to :)

gtwright commented 6 years ago

While we're talking alternative back-end options, I'm pretty in love with both Prisma and GraphCMS these days, since I'm newly obsessed with GraphQL. The Team plan for Prisma is $15/month and has unlimited users, so that's not bad at all.

davemacdo commented 6 years ago

Prisma and GraphCMS look pretty darn cool.

I mentioned this in an earlier Issue, but want to put it in this thread: One option I was looking at early on is Google Firestore, which is another NoSQL document database that also comes with a handful of useful bells and whistles like Firebase Auth that we could use for user management and Cloud Functions, which could help with things like updating coordinates automatically when location fields change. The free plan looks like it could support us for a while, and it even includes some hosting if needed.

gtwright commented 6 years ago

Just put together some quick custom post type/taxonomies/fields in WP to see what this might look at. I've got a little proof of concept up here. @davemacdo Sending you a log in now. If anyone else wants a login, just send me your email.

I went ahead and swapped the Algolia demo over to use the index from this WP site. It should work that any change to a post in WP gets re-indexed right away.

This is all pretty rough, but I just wanted to see how well the indexing would work with this setup. Seems a reasonable proof of concept if we want to go the WP route.

davemacdo commented 6 years ago

I'm pretty impressed with the WordPress editing side, and I think the user management and permissions make it a good choice. Here's a screenshot of the editing UI for those interested:

robdeemer commented 6 years ago

This looks great!

———————— Dr. Rob Deemer Head of Music Composition & Assistant to the President American Council on Education Fellow (2016-2017) School of Music, State University of New York at Fredonia

Chair, NYSSMA Composition & Improvisation Committee Chair-Elect, NAfME Council of Music Composition

Composer-in-Residence, Buffalo Chamber Players Composer-in-Residence, Harmonia Chamber Singers

3162 Mason Hall • Fredonia, NY 14063 • (716) 673-3133 e-mail: deemer@fredonia.edu • web: www.robdeemer.com

On Jun 13, 2018, at 11:30 AM, David MacDonald notifications@github.com wrote:

I'm pretty impressed with the WordPress editing side, and I think the user management and permissions make it a good choice. Here's a screenshot of the editing UI for those interested:

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

davemacdo commented 6 years ago

Graham, do you think a WP backend would work doing our own realtime search, or would it require an indexing service? Querying with each keystroke seems costly, but I haven't tried it.

gtwright commented 6 years ago

@davemacdo Don't know if I would try the realtime/autocomplete approach directly through WP queries (i.e. without indexing through Algolia, Elasticsearch, or something similar). But I don't know if realtime would be totally necessary either.

The WP/Algolia plugin is pretty good at re-indexing every time a change is made. I put a couple of test records in there just (with "[TESTING]" in the name) just for playing around with. The changes in the WP back end should show up in that Algolia mock-up within a few seconds (in theory, it shouldn't even need a refresh). Feel free to edit those like crazy and see what pops up in searches. There's also a console log of all of the search hits so you can see what's coming out there.

I did a quick and dirty import of the old db from the JSON file in this repo (perhaps a few days old). I did a couple of spot checks for accuracy, but not a deep audit. So the composer list in WP should be in decent shape.

I'm still pretty open to the idea of alternative backends. I mainly wanted to see what a WP proof of concept would look like. I'm fairly satisfied that this would do the trick.

Oh, and @davemacdo, I made you an admin in WP, so you can add anyone else you'd like.

Anyone else can get in with username demo, pwd demo with contributor access. Or let me know your email and I'll set you up with your own account to dig around.

davemacdo commented 6 years ago

It's good to know how fast the updates are run. I think we should go for it. I'd lean toward Algolia only because I have nightmares about the AWS console. I agree realtime search isn't necessary, but it is pretty cool.

@gtwright Should we work out the WP side with the content you have now and then grab a fresh set of data right before the switch? I can start a repo for the server-side stuff and give you push access. I can also set up a subdomain of composerdiversity.com for it (admin? research? data?).

gtwright commented 6 years ago

Sorry to be awol...jury duty this week. Fun times. (Looks like I'll be on a case through Tuesday...)

I'd say it's a good plan to work out any WP schema/structure issues, then clear out the db and re-import. I have a fairly jury rigged php template set to import into WP from a JSON. It's masquerading as a page template here.

I've gone ahead and update the admin repo with the Wordpress install, so we can move any Wordpress discussion over there.

davemacdo commented 6 years ago

No worries! I'm crashing a music deadline and planning some travel late next week.

For anybody who wants to follow or contribute to the admin repo, it's here.

davemacdo / composer-diversity

New data structure, preparing database move #31

Background

New format