MAPC / arts-and-culture-asset-list

0 stars 0 forks source link

Change Relationship between Application and Spaces #1

Open theryankelly opened 3 years ago

theryankelly commented 3 years ago

We need to change Applications from a one-to-one association with Spacesto a many to one.

In the creation of this current list, we created a new space every time there was an application from Source 2 (CFF). There should be a relationship of many applications associated with one space. We currently have 684 unique Organizations and should have around the same number of Spaces generated from source 2, not the current 1,693. These extra records cause issues to the Sites table as well.

lperkinshigh commented 3 years ago

I think the true number of duplicated spaces from Source 2 (CFF) is actually around 400. I got here by filtering Spaces to Source 2 (a new field created using the Space ID) and then using the DeDupe Airtable App to isolate Spaces that had identical values for every field BUT:

It's harder to use the method I used to deduplicate the Sites tab here, as I'm identifying these as duplicates based on 20 identical fields, as opposed to 1. I could use a formula to string each of the identical fields together and then link on that field...but it seems risky, as I don't know how Airtable would behave. Thoughts?

theryankelly commented 3 years ago

If the Airtable Dedupe or Open Refine can identify duplicates based on selected columns do, we need to create a single string on which to match? Can we select just one record for each group the 403 records identified by the Airtable dedupe?

I also wonder if we can just forget the whole Space to Application relationship. Since the applications to CFF are for building construction I think it makes more sense to only be associated with the Site.

I would think we could either:

Either path we should edit the name while we are in there and decide on how we want to treat side “the” participles and “^”.

theryankelly commented 3 years ago

Will note that outside of source 2 spaces may be duplicated for other reasons most notably coming from separate sources. Separate sources have brought in data to different columns and records still need to be flattened/merged. We certainly don’t want to flatten or merge just on Space Name or Source Line Number as it’s not unique enough. This is best showcased by Harwich Cultural Center which has 31 spaces with the same Space Name. We can see that 30 are unique spaces and 1 is probably best categorized as a site (151). This site-looking record lists Number of Units listing “50*” which may be data we want to move over to the site. I am sure there are a good number of cases we could find that slipped through the duplication already done and need to either be merged and names reconciled or move to Site where appropriate.

Harwich Cultural Center

Might need to move this to a separate issue to handle further dedup after source 2 application issues, but wanted to loop in @atomtay and @mzagaja to get your thoughts on how we might also script an approach to this duplication problem. How might we leverage Regex or Python or the Dedup.io app to identify differences and merge?