Adding this (HEAD request for metadata) to the regular poll could be useful as well because some web servers give an imprecise size for each file.
I think it should probably be run after_accept, but that might cause issues with duplicate suppression and that should be investigated.
Improvements that should be considered:
Rate limiting: this would cause an HTTP request to be made for each file so we should make sure rate limiting works.
Compression: the existing plugin needs a small modification, because if the server compresses the file, it will report the compressed size. We need to add a header: headers={'Accept-Encoding': 'identity'} to get the real un-compressed size.
This should definitely be something that can be switched on or off easily, because it's not necessary if we get the precise size and date from the directory listing provided by the file server. Maybe it should also be "smart", where it only happens if 1) the size is reported in kB, MB, GB, etc. or 2) there's no size or 3) there's no date.
Maybe @mshak2 could help with this one. I'm not sure how difficult it will be.
This plugin does an HTTP HEAD request for each URL in the scheduled flow config to get the exact file size and modification time. https://github.com/MetPX/sarracenia/blob/development/sarracenia/flowcb/scheduled/http_with_metadata.py
Adding this (HEAD request for metadata) to the regular poll could be useful as well because some web servers give an imprecise size for each file.
I think it should probably be run after_accept, but that might cause issues with duplicate suppression and that should be investigated.
Improvements that should be considered:
headers={'Accept-Encoding': 'identity'}
to get the real un-compressed size.This should definitely be something that can be switched on or off easily, because it's not necessary if we get the precise size and date from the directory listing provided by the file server. Maybe it should also be "smart", where it only happens if 1) the size is reported in kB, MB, GB, etc. or 2) there's no size or 3) there's no date.
Maybe @mshak2 could help with this one. I'm not sure how difficult it will be.