HackBrexit / MinistersUnderTheInfluence

MIT License
6 stars 10 forks source link

db: express whether an entity has been 'cleaned' or not #150

Open aspiers opened 7 years ago

aspiers commented 7 years ago

Description

We need to add a to express whether or not an entity has already been deduplicated. The need for this was realised at the 2017/02/07 meeting, when we discussed how to implement a workflow which would enable volunteers to help manually deduplicate incoming entities from data sources after import.

Comments, Questions and Considerations

We want an automatic deduplication heuristic which runs during import, and tries to automatically match each newly imported entity with an existing one. For example, the most naive implementation would simply look for an existing entity whose name is identical to the one being imported, and if it finds one, it would assume they refer to the same entity, therefore it would reuse the existing one rather than create a new unclean entity.

In terms of implementing this, initially we had thought a boolean flag would suffice, however after further discussion at the 2017/03/14 meeting we concluded that a better way would be to add an extra table between the Organisation Entity table and the Meetings. This table would be for 'Entity Names' and would store names in the raw form read in from the csv file. It would contain a field for the name as well as an optional foreign key link to the entity table, whether or not that key is present will indicate if the entry is 'clean'. Furthermore, the entity table instead of needing a name field itself can link back to an entry in the new table which will then be considered the canonical name for that entity. If the canonical name has yet to be seen then a new entry can be created purely for that purpose.

Blocks

Acceptance Criteria

This story can be considered done when the following acceptance tests are satisfied:

Given a new data file to import When the data file is imported Then the importer tries to match each value in the data file against all entity names already in the database, for each value where no match can be found, it creates a new entity name that's not yet linked to an entity.

aspiers commented 7 years ago

@JohnSmall @Greatlemer Does this look right to you?

Greatlemer commented 7 years ago

@aspiers, I've repurposed this case to fit with what we discussed last week, hope that's ok.

aspiers commented 7 years ago

@Greatlemer More than OK, it's what I would have suggested ;-)