Add filesize sniffing to specify the import job's disk space for WDL and CWL

stxue1 commented 1 month ago

When --importWorkerDisk is not specified for WDL and CWL, there should be some sort of file sniffing functionality to detect the minimum amount of disk space needed for the importing job to run. This probably can't take in account whether a file is streamable or not. We can probably try to stream the files in first with a small disk requirement, and then when that doesn't work, start a child job to run the imports without streaming but with more disk space.

This will likely require retooling the import interface so file streaming can be controlled there.

┆Issue is synchronized with this Jira Story ┆Issue Number: TOIL-1655

stxue1 commented 1 month ago

We think the right approach is to do the filesize data collection on the leader and then delegate two types of jobs: There will be a batch job that imports all files below a certain filesize. Every file thats size is larger than the threshold will have its own job.

The construction of the import jobs DAG will happen on the leader after it's collected all file sizes.

This should enable maximum parallelization when possible. The threshold will be configurable by the user and will usurp the --importWorkersDisk argument.

There also will likely be changes to the internal import_file interface (or something related) to enforce importing files via streaming, ex: direct download vs copying a file over into the TMP first.

We may not support files that don't let us look at the size of the file before downloading. This may include FTP and old HTTP servers. Alternatively, we can implement a fallback for files where we can't sniff out the sizes. Since CWL/WDL don't seem to currently support FTP, this won't be a priority. (Though MiniWDL does support FTP, maybe we should make this higher priority for behavioral parity?)

There seems to be some inefficiency in the get_filesize function as well; apparently instead of a HEAD request, we just do a general GET request and ask for the file in full, then grab what we need and throw away/deny the rest of the download. Ideally we should just send a HEAD request.

This issue is tangentially related to #5113 and #5025

stxue1 commented 1 month ago

An import job will first attempt to import via streaming. When that fails, a child job will be made to handle the import, with the child job's disk requirement being the associated filesize. (How will this work exactly for the batch import? I think the child job will consist of either the remaining batch or just that one failed file. If it's just one failed file, won't we run into the same issue we tried to avoid with the batch import, where all small files get their own inefficient job? Especially when the file is local where we will just try a symlink.)

If the import ultimately fails, we might just bail until we figure out a fallback.

When we can't get the filesize of some file, maybe we should just import it on the leader and warn the user that their file is bad.

DataBiosphere / toil

Add filesize sniffing to specify the import job's disk space for WDL and CWL #5114