datagovuk / ckanext-dgu

CKAN extension for data.gov.uk
http://data.gov.uk/
34 stars 33 forks source link

Limit disk space used by archiver #420

Closed davidread closed 8 years ago

davidread commented 8 years ago

Cleared 15GB with the first gen tool. Thought it would be more, but it appears to include archivals of deleted datasets in its report figures.

(ckan)co@prod3 /vagrant/src/ckanext-archiver (master) $ paster --plugin=ckanext-archiver archiver size-report
2016-05-11 16:59:03,289 DEBUG [ckanext.spatial.model.package_extent] Spatial tables defined in memory
2016-05-11 16:59:03,295 DEBUG [ckanext.spatial.model.package_extent] Spatial tables already exist
2016-05-11 16:59:03,312 DEBUG [ckanext.harvest.model] Harvest tables defined in memory
2016-05-11 16:59:03,315 DEBUG [ckanext.harvest.model] Harvest tables already exist
      file size no. files  files size (bytes)
          <1 KB    12,089           5,419,574
        1-10 KB    48,030         216,584,887
      10-100 KB   103,312       3,493,454,301
  100 KB - 1 MB    31,010      10,183,301,959
        1-10 MB    13,541      47,547,908,691
      10-100 MB     3,976     133,207,131,860
  100 MB - 1 GB       608     198,560,495,029
        1-10 GB       122     233,024,877,438
      10-100 GB         0                   0
        >100 GB         0                   0
Totals: 212,688 626,239,173,739

(ckan)co@prod3 /vagrant/src/ckanext-archiver (master) $ df -h /media/hulk/
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdc1       459G  459G   29M 100% /media/hulk

$ sudo -u www-data /home/co/ckan/bin/paster --plugin=ckanext-archiver archiver delete-files-larger-than-max -c /var/ckan/ckan.ini
2016-05-11 17:02:59,189 DEBUG [ckanext.spatial.model.package_extent] Spatial tables defined in memory
2016-05-11 17:02:59,195 DEBUG [ckanext.spatial.model.package_extent] Spatial tables already exist
2016-05-11 17:02:59,211 DEBUG [ckanext.harvest.model] Harvest tables defined in memory
2016-05-11 17:02:59,215 DEBUG [ckanext.harvest.model] Harvest tables already exist
38 archivals above the 2,000,000,000 threshold with total size 114,059,599,346
Deleting <Archival Downloaded OK /dataset/national-statistics-postcode-lookup-uk/resource/1cd39f21-53fb-4ac3-84ee-05ba5ca2c933 >
..deleted /media/hulk/ckan_resource_cache/1c/1cd39f21-53fb-4ac3-84ee-05ba5ca2c933/rows.xml
Deleting <Archival Downloaded OK /dataset/land-registry-monthly-price-paid-data/resource/f663ca39-6df7-41c4-8b31-6f3e49f87510 >
..deleted /media/hulk/ckan_resource_cache/f6/f663ca39-6df7-41c4-8b31-6f3e49f87510/pp-complete.txt
Deleting <Archival Downloaded OK /dataset/osni-open-data-river-basin-lidar-2009-dtms-and-dsms/resource/90955125-d215-45d8-80c2-c4a7eb8df40e >
..deleted /media/hulk/ckan_resource_cache/90/90955125-d215-45d8-80c2-c4a7eb8df40e/Ballymena_23_04_2009.zip
Deleting <Archival Downloaded OK /dataset/osni-open-data-river-basin-lidar-2009-dtms-and-dsms/resource/9746151e-9b2a-4cfb-b2d5-7ea9c8fcabd9 >
..deleted /media/hulk/ckan_resource_cache/97/9746151e-9b2a-4cfb-b2d5-7ea9c8fcabd9/Londonderry_30_04_2009.zip
Deleting <Archival Not sure if broken /dataset/land-registry-monthly-price-paid-data/resource/17106445-eeaa-464e-aca8-67fb222a0798 >
ERROR deleting /mnt/shared/ckan_resource_cache/17/17106445-eeaa-464e-aca8-67fb222a0798/PPMS_Mar_2012_ew_with-columns.csv
Deleting <Archival Downloaded OK /dataset/land-registry-monthly-price-paid-data/resource/a8ab99a5-3d51-4f28-bd64-f999e752e57b >
..deleted /media/hulk/ckan_resource_cache/a8/a8ab99a5-3d51-4f28-bd64-f999e752e57b/pp-complete.csv

(ckan)co@prod3 /vagrant/src/ckanext-archiver (master) $ paster --plugin=ckanext-archiver archiver size-report

2016-05-11 17:05:31,001 DEBUG [ckanext.spatial.model.package_extent] Spatial tables defined in memory
2016-05-11 17:05:31,007 DEBUG [ckanext.spatial.model.package_extent] Spatial tables already exist
2016-05-11 17:05:31,024 DEBUG [ckanext.harvest.model] Harvest tables defined in memory
2016-05-11 17:05:31,027 DEBUG [ckanext.harvest.model] Harvest tables already exist
      file size no. files  files size (bytes)
          <1 KB    12,089           5,419,574
        1-10 KB    48,030         216,584,887
      10-100 KB   103,315       3,493,593,852
  100 KB - 1 MB    31,011      10,183,406,407
        1-10 MB    13,541      47,547,908,691
      10-100 MB     3,976     133,207,131,860
  100 MB - 1 GB       608     198,560,495,029
        1-10 GB       122     233,024,877,438
      10-100 GB         0                   0
        >100 GB         0                   0
Totals: 212,692 626,239,417,738
(ckan)co@prod3 /vagrant/src/ckanext-archiver (master) $ df -h /media/hulk/
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdc1       459G  444G   15G  97% /media/hulk
davidread commented 8 years ago

Improved version of stats after running the tool (now excludes deleted resources/packages so much better figures):

      file size no. files  files size (bytes)
          <1 KB     6,348           3,179,212
        1-10 KB    21,193          91,875,918
      10-100 KB    52,340       1,808,154,330
  100 KB - 1 MB    14,078       4,502,765,524
        1-10 MB     5,597      19,602,547,888
      10-100 MB     1,721      71,469,455,696
  100 MB - 1 GB       426     115,872,674,171
        1-10 GB        88     133,191,097,679
      10-100 GB         0                   0
        >100 GB         0                   0
Totals: 101,791 346,541,750,418

(ckan)co@prod3 ~ () $ df -h /media/hulk/
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdc1       459G  444G   15G  97% /media/hulk
davidread commented 8 years ago

Changed max size from 2GB to 1GB and increased the disk space from 3% to 27% for a greater margin.

(ckan)co@prod3 /vagrant/src/ckanext-archiver (master) $ paster --plugin=ckanext-archiver archiver size-report
2016-05-11 17:37:04,153 DEBUG [ckanext.spatial.model.package_extent] Spatial tables defined in memory
2016-05-11 17:37:04,159 DEBUG [ckanext.spatial.model.package_extent] Spatial tables already exist
2016-05-11 17:37:04,176 DEBUG [ckanext.harvest.model] Harvest tables defined in memory
2016-05-11 17:37:04,179 DEBUG [ckanext.harvest.model] Harvest tables already exist
      file size no. files  files size (bytes)
          <1 KB     6,399           3,199,988
        1-10 KB    21,708          96,046,298
      10-100 KB    52,670       1,822,751,526
  100 KB - 1 MB    14,179       4,557,149,644
        1-10 MB     5,631      19,744,994,609
      10-100 MB     1,753      72,283,320,289
  100 MB - 1 GB       423     113,423,795,153
        1-10 GB         0                   0
      10-100 GB         0                   0
        >100 GB         0                   0
Totals: 102,763 211,931,257,507
(ckan)co@prod3 /vagrant/src/ckanext-archiver (master) $ df -h /media/hulk/
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdc1       459G  333G  127G  73% /media/hulk
pigspamster commented 8 years ago

to discuss in sprint planning