City-of-Bloomington / drupal-customizations

Drupal version management using composer
https://bloomington.in.gov
GNU General Public License v2.0
1 stars 1 forks source link

Export all uploaded files for ingesting into Document Management #35

Open inghamn opened 4 years ago

inghamn commented 4 years ago

We are installing the document management system, OnBase. Instead of staff just uploading files to Drupal, we want to host the files in OnBase, and only link to the files from Drupal.

We need to have a way to migrate all the current media out of Drupal, and ingest it into OnBase.

inghamn commented 4 years ago

We can probably use file_managed.filesize to identify duplicates.

select i.entity_id, i.field_image_width, i.field_image_height,
       f.uuid, f.filename, f.uri, f.filemime, f.filesize,
       f.created, f.changed
from media__field_image i
join file_managed       f on f.fid=i.field_image_target_id
where bundle='cover_image'
  and f.filesize=595572;
inghamn commented 4 years ago

We should probably do some deduplication during the export. If the files have the same name and the same filesize, they are most likely the same file. We could grow a lookup hash and skip files that have already been exported.

inghamn commented 4 years ago

I reviewed the ~1,000 or so files that would be considered duplicates based on filesize. It looks like we're safe to consider filesize to determine unique files. I don't think we need to do any hashing of the files themselves.

select x.filesize, f.filename, f.uri
from (select filesize, count(*) as c
      from file_managed
      left join media__field_image on fid=field_image_target_id
      where field_image_target_id is null
      group by filesize having c>1) x
join file_managed f on x.filesize=f.filesize
order by x.filesize, f.filename;