Princeton-CDH / mep-django

Shakespeare and Company Project - Python/Django web application
https://shakespeareandco.princeton.edu
Apache License 2.0
5 stars 1 forks source link

New export management commands #804

Closed quadrismegistus closed 7 months ago

quadrismegistus commented 8 months ago

We need to customize/add to the JSON+CSV export commands we now have (export_members, export_books, export_events) to reflect the following:

  1. A new genre field called category on books, doable by modifying export_books. No issue yet

  2. New nationality and gender info on authors/creators, and indeed all info on creators, doable by writing a export_creators command that ideally subclasses/inherits most of the logic from the export_members command, since both draw on the same Person table. See #684

  3. New time-sensitive location information about who lived where when, doable by a new separate command, export_locations. See #790

This PR has a minimal working version of (2) using the approach described there.

codecov[bot] commented 8 months ago

Codecov Report

Merging #804 (f70e772) into develop (b3d034b) will increase coverage by 0.00%. The diff coverage is 98.60%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## develop #804 +/- ## ========================================= Coverage 97.39% 97.40% ========================================= Files 231 233 +2 Lines 13241 13349 +108 Branches 81 81 ========================================= + Hits 12896 13002 +106 - Misses 342 344 +2 Partials 3 3 ```
quadrismegistus commented 7 months ago

All the code changes look good, I think we mainly have questions about finalizing field specifics. I'm inclined to get this in front of Josh sooner rather than later while we decide the picky details. Should be straightforward to make remaining field tweaks after that, I think.

The one change I'd like to see before we merge for a first round of testing is to simplify the location export code to use an Address queryset and only use related accounts.

If you find any instances of addresses associated with person instead of account in production data, let's treat it as a data cleanup problem.

In this case, since we want a row per person per address, and if we return simply one Address per one call of get_export_data() -> dict, and since addresses can contain more than one person, I'm not sure how we can guarantee to get one person per exported dict. There are 53 addresses with more than one person in the associated account.

quadrismegistus commented 7 months ago

There are also 3 addresses with 0 persons in the account:


In [8]: for addr in Address.objects.all():
   ...:     if len(addr.account.persons.all())==0:
   ...:         print(f'{len(addr.account.persons.all())} persons in account for address {addr}')
   ...: 
0 persons in account for address 41 avenue du Maine, Paris — Account #320: 41 avenue du Maine
0 persons in account for address 220 boulevard Raspail, Paris — Account #7544: 220 boulevard Raspail
0 persons in account for address 41 rue Censier, Paris — Account #7652: 41 rue Censier
quadrismegistus commented 7 months ago

In this case, since we want a row per person per address, and if we return simply one Address per one call of get_export_data() -> dict, and since addresses can contain more than one person, I'm not sure how we can guarantee to get one person per exported dict. There are 53 addresses with more than one person in the associated account.

Here's what happens if we use get_object_data per address object:

def get_object_data(self, obj):
        """
        Generate dictionary of data to export for a single
        :class:`~mep.people.models.Person`
        """
        addr = obj
        loc = addr.location
        persons = addr.account.persons.all()

        # required properties
        return dict(
            # Member
            member_id=[person.slug for person in persons],
            member_uri=[absolutize_url(person.get_absolute_url()) for person in persons],
            # Address data
            start_date=addr.partial_start_date,
            end_date=addr.partial_end_date,
            care_of_person_id=addr.care_of_person.slug if addr.care_of_person else None,
            # Location data
            street_address=loc.street_address,
            city=loc.city,
            postal_code=loc.postal_code,
            latitude=float(loc.latitude) if loc.latitude is not None else None,
            longitude=float(loc.longitude) if loc.longitude is not None else None,
            country=loc.country.name if loc.country else None,
            arrondissement=loc.arrondissement(),
        )

-> that uses semicolons to join the lists:

member_id   member_uri  care_of_person_id   street_address  postal_code city    arrondissement  country start_date  end_date    longitude   latitude
coyle-kathleen;coyle-kestrel-2  https://shakespeareandco.princeton.edu/members/coyle-kathleen/;https://shakespeareandco.princeton.edu/members/coyle-kestrel-2/      79 rue Denfert-Rochereau    75014   Paris   14  France  1/26/33 2/23/33 2.3343  48.8359

Which is fine but makes it not possible to easily join the csv from members to locations.

rlskoeser commented 7 months ago

There are also 3 addresses with 0 persons in the account:

Thanks for finding. Please run by Josh and check if he knows whether they should be deleted or need to be associated with an account; hopefully he'll remember or be able to figure out. Maybe they were removed from an account but not deleted, or maybe they were duplicates. If need be, we can check if there's anything useful in the admin log entries.

rlskoeser commented 7 months ago

In this case, since we want a row per person per address, and if we return simply one Address per one call of get_export_data() -> dict, and since addresses can contain more than one person, I'm not sure how we can guarantee to get one person per exported dict. There are 53 addresses with more than one person in the associated account.

It occurred to me after our last back and forth about this, addresses are like events - they are linked to accounts, not to people, and as you've noted we have a subset of accounts shared by people. My solution in the events dataset was a member_uris field that included delimited member uris for multi-person accounts - my intent was that labeling this field should make it clear you have to choose how to handle shared accounts. Whatever we do, the events and addresses datasets should work the same way. I think Josh and I documented this pretty clearly in the dataset essay (both events dataset and the fact of shared accounts).

quadrismegistus commented 7 months ago

In this case, since we want a row per person per address, and if we return simply one Address per one call of get_export_data() -> dict, and since addresses can contain more than one person, I'm not sure how we can guarantee to get one person per exported dict. There are 53 addresses with more than one person in the associated account.

It occurred to me after our last back and forth about this, addresses are like events - they are linked to accounts, not to people, and as you've noted we have a subset of accounts shared by people. My solution in the events dataset was a member_uris field that included delimited member uris for multi-person accounts - my intent was that labeling this field should make it clear you have to choose how to handle shared accounts. Whatever we do, the events and addresses datasets should work the same way. I think Josh and I documented this pretty clearly in the dataset essay (both events dataset and the fact of shared accounts).

Good point—we already are committed to joining multiple people in events table so for consistency let's just keep doing that. My latest commit is an example of that happening.