ESGF / esgf-download

ESGF data transfer and replication tool
https://esgf.github.io/esgf-download/
BSD 3-Clause "New" or "Revised" License
15 stars 2 forks source link

Greatly improve database performance for `esgpull update` #47

Closed svenrdz closed 3 months ago

svenrdz commented 3 months ago

New

Changed

Example

I created and filled this database in a few minutes with large queries that previously took hours to run, now a good chunk of the time is spent on fetching metadata from index nodes:

$ esgpull show
<853631>
├── distrib:   True 
│   latest:    True 
│   replica:   None 
│   retracted: False
│   table_id:  fx   
└── <1124a9>
    └── distrib:       True 
        latest:        True 
        replica:       None 
        retracted:     False
        experiment_id: dcpp*
<c95ebd>
└── distrib:       True  
    latest:        True  
    replica:       None  
    retracted:     False 
    experiment_id: ssp245
    frequency:     day   
    variant_label: r1i*  
<cc2c09>
├── distrib:       True       
│   latest:        True       
│   replica:       None       
│   retracted:     False      
│   frequency:     day        
│   variable_id:   tas, tasmax
│   variant_label: r1i*       
└── <ef4f6f>
    └── distrib:       True                        
        latest:        True                        
        replica:       None                        
        retracted:     False                       
        experiment_id: ssp245                      
        files:         0 bytes / 338.8 GiB [0/2206]

$ time esgpull update -y
<1124a9> -> 60759 files.
<853631> -> 191146 files.
<c95ebd> -> 126748 files.
<cc2c09> -> 227623 files.
<ef4f6f> -> 6724 files.
613000 files found.
<1124a9> ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:32
<853631> ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:01:18
<c95ebd> ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:48
<cc2c09> ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:01:43
<ef4f6f> ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
esgpull update -y  419.46s user 11.52s system 93% cpu 7:41.62 total

Running a similar test over a non-empty database (~1.4GB) produces no significant difference:

$ esgpull show cc2c -c
<cc2c09>
├── distrib:       True       
│   latest:        True       
│   replica:       None       
│   retracted:     False      
│   frequency:     day        
│   variable_id:   tas, tasmax
│   variant_label: r1i*       
└── <ef4f6f>
    └── distrib:       True  
        latest:        True  
        replica:       None  
        retracted:     False 
        experiment_id: ssp245

$ time esgpull update cc2c -c -y
<cc2c09> -> 227623 files.
<ef4f6f> -> 6724 files.
234347 files found.
<cc2c09> ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:01:49
<ef4f6f> ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:01
esgpull update cc2c -c -y  170.65s user 4.87s system 87% cpu 3:21.09 total

Before the current PR, a non-empty database would take longer to update. Multiple reasons made it very inefficient SQL to add a new relation to a query for a file that already had existing relations to other queries. This is now a single insert in all cases, which makes it irrelevant for the database to be empty or not.