boston-library / curator

Rails API Engine For Replacing Fedora With A more SQL driven approach
MIT License
6 stars 0 forks source link

Investigate possible memory leak with the `for_reindex_all` scope on items #352

Open bbarberBPL opened 5 months ago

bbarberBPL commented 5 months ago

Noticed that when querying a bunch of digital objects(~600) using the for_reindex_all scope memory usage on one of the servers jumped from 2GB to 10GB. And the PostgreSQL connection was using a lot of CPU. I need to determine if there is a way to make this query more efficient. It is possible this is also occurring on the postgres end so we may want to look into doing additional tuning there as well

bbarberBPL commented 5 days ago

Running Curator.digital_object.for_reindex_all.explain yielded the following

 Hash Left Join  (cost=35.44..877.72 rows=86 width=3965)
   Hash Cond: (mappings_exemplary_images.exemplary_file_set_id = exemplary_file_sets_curator_mappings_exemplary_images.id)
   ->  Hash Left Join  (cost=31.85..873.89 rows=86 width=3856)
         Hash Cond: (digital_objects.id = filestreams_file_sets.file_set_of_id)
         ->  Hash Left Join  (cost=28.25..868.92 rows=34 width=3747)
               Hash Cond: (metastreams_descriptives.id = mappings_desc_host_collections.descriptive_id)
               ->  Nested Loop Left Join  (cost=20.80..861.12 rows=22 width=3695)
                     Join Filter: (mappings_exemplary_images.exemplary_object_id = digital_objects.id)
                     ->  Nested Loop Left Join  (cost=20.80..852.45 rows=22 width=3649)
                           ->  Nested Loop Left Join  (cost=20.51..696.01 rows=22 width=3543)
                                 ->  Nested Loop Left Join  (cost=20.22..539.57 rows=22 width=3429)
                                       ->  Nested Loop Left Join  (cost=19.93..383.13 rows=22 width=3315)
                                             Join Filter: (resource_types_curator_metastreams_descriptives.id = mappings_desc_terms.mapped_term_id)
                                             ->  Nested Loop Left Join  (cost=19.63..339.79 rows=22 width=3201)
                                                   ->  Nested Loop Left Join  (cost=18.78..93.66 rows=2 width=2965)
                                                         ->  Nested Loop Left Join  (cost=18.48..76.99 rows=2 width=2851)
                                                               ->  Nested Loop Left Join  (cost=18.17..60.32 rows=2 width=2737)
                                                                     ->  Nested Loop Left Join  (cost=17.87..43.87 rows=2 width=2623)
                                                                           ->  Nested Loop Left Join  (cost=17.58..27.48 rows=2 width=2509)
                                                                                 Join Filter: (mappings_desc_name_roles.descriptive_id = metastreams_descriptives.id)
                                                                                 ->  Nested Loop Left Join  (cost=17.58..25.69 rows=1 width=2477)
                                                                                       ->  Nested Loop Left Join  (cost=17.44..25.53 rows=1 width=2381)
                                                                                             ->  Merge Join  (cost=17.15..17.22 rows=1 width=2267)
                                                                                                   Merge Cond: (metastreams_administratives.administratable_id = metastreams_descriptives.digital_object_id)
                                                                                                   ->  Sort  (cost=6.13..6.14 rows=7 width=223)
                                                                                                         Sort Key: digital_objects.id
                                                                                                         ->  Hash Join  (cost=1.63..6.03 rows=7 width=223)
                                                                                                               Hash Cond: (metastreams_administratives.administratable_id = digital_objects.id)
                                                                                                               ->  Seq Scan on metastreams_administratives  (cost=0.00..4.33 rows=28 width=153)
                                                                                                                     Filter: ((administratable_type)::text = 'Curator::DigitalObject'::text)
                                                                                                               ->  Hash  (cost=1.28..1.28 rows=28 width=70)
                                                                                                                     ->  Seq Scan on digital_objects  (cost=0.00..1.28 rows=28 width=70)
                                                                                                   ->  Sort  (cost=11.02..11.03 rows=6 width=2044)
                                                                                                         Sort Key: metastreams_descriptives.digital_object_id
                                                                                                         ->  Hash Join  (cost=4.67..10.94 rows=6 width=2044)
                                                                                                               Hash Cond: (metastreams_descriptives.digital_object_id = metastreams_workflows.workflowable_id)
                                                                                                               ->  Seq Scan on metastreams_descriptives  (cost=0.00..6.21 rows=21 width=1922)
                                                                                                               ->  Hash  (cost=4.33..4.33 rows=28 width=122)
                                                                                                                     ->  Seq Scan on metastreams_workflows  (cost=0.00..4.33 rows=28 width=122)
                                                                                                                           Filter: ((workflowable_type)::text = 'Curator::DigitalObject'::text)
                                                                                             ->  Index Scan using controlled_terms_nomenclatures_pkey on controlled_terms_nomenclatures  (cost=0.29..8.31 rows=1 width=114)
                                                                                                   Index Cond: (id = metastreams_descriptives.physical_location_id)
                                                                                                   Filter: ((type)::text = 'Curator::ControlledTerms::Name'::text)
                                                                                       ->  Index Scan using controlled_terms_authorities_pkey on controlled_terms_authorities  (cost=0.14..0.16 rows=1 width=96)
                                                                                             Index Cond: (id = controlled_terms_nomenclatures.authority_id)
                                                                                 ->  Seq Scan on mappings_desc_name_roles  (cost=0.00..1.35 rows=35 width=32)
                                                                           ->  Index Scan using controlled_terms_nomenclatures_pkey on controlled_terms_nomenclatures names_curator_mappings_desc_name_roles  (cost=0.29..8.20 rows=1 width=114)
                                                                                 Index Cond: (id = mappings_desc_name_roles.name_id)
                                                                                 Filter: ((type)::text = 'Curator::ControlledTerms::Name'::text)
                                                                     ->  Memoize  (cost=0.30..8.21 rows=1 width=114)
                                                                           Cache Key: mappings_desc_name_roles.role_id
                                                                           Cache Mode: logical
                                                                           ->  Index Scan using controlled_terms_nomenclatures_pkey on controlled_terms_nomenclatures roles_curator_mappings_desc_name_roles  (cost=0.29..8.20 rows=1 width=114)
                                                                                 Index Cond: (id = mappings_desc_name_roles.role_id)
                                                                                 Filter: ((type)::text = 'Curator::ControlledTerms::Role'::text)
                                                               ->  Memoize  (cost=0.30..8.32 rows=1 width=114)
                                                                     Cache Key: metastreams_descriptives.license_id
                                                                     Cache Mode: logical
                                                                     ->  Index Scan using controlled_terms_nomenclatures_pkey on controlled_terms_nomenclatures licenses_curator_metastreams_descriptives  (cost=0.29..8.31 rows=1 width=114)
                                                                           Index Cond: (id = metastreams_descriptives.license_id)
                                                                           Filter: ((type)::text = 'Curator::ControlledTerms::License'::text)
                                                         ->  Memoize  (cost=0.30..8.32 rows=1 width=114)
                                                               Cache Key: metastreams_descriptives.rights_statement_id
                                                               Cache Mode: logical
                                                               ->  Index Scan using controlled_terms_nomenclatures_pkey on controlled_terms_nomenclatures rights_statements_curator_metastreams_descriptives  (cost=0.29..8.31 rows=1 width=114)
                                                                     Index Cond: (id = metastreams_descriptives.rights_statement_id)
                                                                     Filter: ((type)::text = 'Curator::ControlledTerms::RightsStatement'::text)
                                                   ->  Nested Loop Left Join  (cost=0.85..122.99 rows=8 width=244)
                                                         ->  Nested Loop Left Join  (cost=0.56..66.07 rows=8 width=130)
                                                               ->  Index Scan using "index_curator.mappings_desc_terms_on_descriptive_id" on mappings_desc_terms  (cost=0.27..9.17 rows=8 width=16)
                                                                     Index Cond: (descriptive_id = metastreams_descriptives.id)
                                                               ->  Index Scan using controlled_terms_nomenclatures_pkey on controlled_terms_nomenclatures genres_curator_metastreams_descriptives  (cost=0.29..7.11 rows=1 width=114)
                                                                     Index Cond: (id = mappings_desc_terms.mapped_term_id)
                                                                     Filter: ((type)::text = 'Curator::ControlledTerms::Genre'::text)
                                                         ->  Index Scan using controlled_terms_nomenclatures_pkey on controlled_terms_nomenclatures languages_curator_metastreams_descriptives  (cost=0.29..7.11 rows=1 width=114)
                                                               Index Cond: (id = mappings_desc_terms.mapped_term_id)
                                                               Filter: ((type)::text = 'Curator::ControlledTerms::Language'::text)
                                             ->  Materialize  (cost=0.29..39.74 rows=11 width=114)
                                                   ->  Index Scan using "index_curator.controlled_terms_nomenclatures_on_type" on controlled_terms_nomenclatures resource_types_curator_metastreams_descriptives  (cost=0.29..39.69 rows=11 width=114)
                                                         Index Cond: ((type)::text = 'Curator::ControlledTerms::ResourceType'::text)
                                       ->  Index Scan using controlled_terms_nomenclatures_pkey on controlled_terms_nomenclatures subject_topics_curator_metastreams_descriptives  (cost=0.29..7.11 rows=1 width=114)
                                             Index Cond: (id = mappings_desc_terms.mapped_term_id)
                                             Filter: ((type)::text = 'Curator::ControlledTerms::Subject'::text)
                                 ->  Index Scan using controlled_terms_nomenclatures_pkey on controlled_terms_nomenclatures subject_names_curator_metastreams_descriptives  (cost=0.29..7.11 rows=1 width=114)
                                       Index Cond: (id = mappings_desc_terms.mapped_term_id)
                                       Filter: ((type)::text = 'Curator::ControlledTerms::Name'::text)
                           ->  Index Scan using controlled_terms_nomenclatures_pkey on controlled_terms_nomenclatures subject_geos_curator_metastreams_descriptives  (cost=0.29..7.11 rows=1 width=114)
                                 Index Cond: (id = mappings_desc_terms.mapped_term_id)
                                 Filter: ((type)::text = 'Curator::ControlledTerms::Geographic'::text)
                     ->  Materialize  (cost=0.00..1.45 rows=23 width=46)
                           ->  Seq Scan on mappings_exemplary_images  (cost=0.00..1.34 rows=23 width=46)
                                 Filter: ((exemplary_object_type)::text = 'Curator::DigitalObject'::text)
               ->  Hash  (cost=7.05..7.05 rows=32 width=60)
                     ->  Hash Left Join  (cost=5.63..7.05 rows=32 width=60)
                           Hash Cond: (mappings_desc_host_collections.host_collection_id = mappings_host_collections.id)
                           ->  Seq Scan on mappings_desc_host_collections  (cost=0.00..1.32 rows=32 width=16)
                           ->  Hash  (cost=5.28..5.28 rows=28 width=52)
                                 ->  Seq Scan on mappings_host_collections  (cost=0.00..5.28 rows=28 width=52)
         ->  Hash  (cost=2.71..2.71 rows=71 width=109)
               ->  Seq Scan on filestreams_file_sets  (cost=0.00..2.71 rows=71 width=109)
   ->  Hash  (cost=2.71..2.71 rows=71 width=109)
         ->  Seq Scan on filestreams_file_sets exemplary_file_sets_curator_mappings_exemplary_images  (cost=0.00..2.71 rows=71 width=109)
(102 rows)

This is way to expensive of a query. Not to mention highly inneficient. I think we should simplify/refactor this as well as the for_serialization_query