MetPX / sarracenia

https://MetPX.github.io/sarracenia
GNU General Public License v2.0
45 stars 22 forks source link

Should we do HTTP head request for regular polls? #1131

Open reidsunderland opened 4 months ago

reidsunderland commented 4 months ago

This plugin does an HTTP HEAD request for each URL in the scheduled flow config to get the exact file size and modification time. https://github.com/MetPX/sarracenia/blob/development/sarracenia/flowcb/scheduled/http_with_metadata.py

Adding this (HEAD request for metadata) to the regular poll could be useful as well because some web servers give an imprecise size for each file.

I think it should probably be run after_accept, but that might cause issues with duplicate suppression and that should be investigated.

Improvements that should be considered:

This should definitely be something that can be switched on or off easily, because it's not necessary if we get the precise size and date from the directory listing provided by the file server. Maybe it should also be "smart", where it only happens if 1) the size is reported in kB, MB, GB, etc. or 2) there's no size or 3) there's no date.

Maybe @mshak2 could help with this one. I'm not sure how difficult it will be.

andreleblanc11 commented 3 months ago

Related to your first improvement : https://github.com/MetPX/sarracenia/issues/935

petersilva commented 3 months ago

also related, solving the problem downstream if need be: #1157