MetPX / sarracenia

https://MetPX.github.io/sarracenia
GNU General Public License v2.0
45 stars 22 forks source link

do_download port to v3 missing. #437

Closed petersilva closed 2 years ago

petersilva commented 2 years ago

follow pattern of do_poll.

petersilva commented 2 years ago

see #74

petersilva commented 2 years ago

v2 download API:

often had to fake the protocol, because the downloader was doing something strange (for a particular use case) and it would check for schemes to register do_download methods in... hmm...

The idea of adding protocols in v3 is move to the sarracenia.transfer classes. so having the protocol involved is not helpful. a couple of options... do we do (as a flowcb entry point)

  1. download( self, worklist ): -- accepts a whole list of files selected for downloading, and deals with them as a group. return modified worklist.

  2. download( self, msg, new_path ) -- download a single file, to a given destination? using the whole message?

  3. download( self, url, new_path ) -- no message at all just download (cannot change the message to reflect what is actually downloaded (common case, the name is changed.)

  4. download( msg, remote_file, new_inflight_path, remote_offset, local_offset, block_length ) returning len_written -- reflecting the routine API for sarracenia.transfer.get() calls implemented per protocol.

sample use case is #355, where client wants the data gzipped on the fly and only write the resultant file. hmm...

petersilva commented 2 years ago

considerations... what options does the downloader have to interpret (re-implementing built-in functionality.)?

with option 1, the entire download function is replaced, and all options need to be re-implemented. downsides:

a) inflight picking the temporary file name for files that haven't finished yet. The name is assigned automatically with the built-in downloader.

b) overwrite: whether to overwrite a file if it already exists.

c) attempts, and assignment to the "failed" worklist... for retry processing.

d) processing of 'link', 'remove', 'rename' messages, which do not involve file transfers. (only used in mirroring use cases... fairly rarely intersecting with custom download processing.)

e) processing of messages that include the content of the file 'content' tag.

f) checksum calculation as the transfer happens, to ensure the data is the same as advertised.

g) binary accellerators (getAccelerated) use of wget c-based downloader when advantageous.

h) downloading the data.

endlisnis commented 2 years ago

sample use case is #355, where client wants the data gzipped on the fly and only write the resultant file.

Actually, my real desire is to take the raw data from the http(s) transfer, which is probably already gzipped, and just save that as a .gz file. No point in wasting CPU expanding it only to compress it again.

petersilva commented 2 years ago

gave an alternative that provides that in the other issue. an interesting thing is that when you fetch something compressed, you can´t match the size, since the compressed length is different, maybe something in HTTP tells us it is good.hmm.. anyways the purpose of this issue is to be able to write custom downloaders, there are a bunch of different use cases... complements for all the polls provided as examples: in flowcb/poll all three examples possibly require custom downloads (mail.py nexrad.py and usgs.py) another would be "stream" that is to download into memory, and let custom code write to disk, which is how I initially understood the requirement from #355.

petersilva commented 2 years ago

aspects plugin writer would have to take care of:

  1. a) through f),... pretty much has to re-write the whole algorithm and take care of honouring all settings.
  2. b) f) h)
  3. b) f) h)
  4. b) f) h)

now in the use case where you are downloading a stream of data and it results in more than one file being written... only case 1 allows you to modify the worklist, and have a different number of files to post than what was input... which was a major motivation for the API change from v2 to v3... hmm...

endlisnis commented 2 years ago

Your example in #355 is significantly slower because it spends a lot of CPU launching a separate subprocess "wget" to handle the download. I was under the impression that downloads were normally gzipped when they came off of the dd servers, but if that's not the case, then it would make sense to just provide (as you said) a way of handling the download data before it hit the disk, or maybe just a plugin for actually saving the file to disk.

petersilva commented 2 years ago

Probably going with 1... first shot, will try to re-arrange code so that the messages have the fields for a) added before invocation, to save plugin authors a little work.

petersilva commented 2 years ago

otoh... the send( plugin matches 2. and 3. It is a little odd for download and send to be different.

petersilva commented 2 years ago

the implementation addresses options 2, 3, and 4 by setting the new_ (including inflight) values in the msg field and passing that. It occurs to me that for option 1. we just use after_accept to write a downloader with download = False.

so all cases are covered.