Open ElusiveTau opened 5 years ago
Hi, So I understand that you want to extract data over a large temporal coverage superior to one year. If you just need the original dataset files, FTP is a better protocol to use, but if you want to subset over geographical or variable criterias motu-client-python is the best way. But you have to know that Motu server limits the extraction result size and also limits the number of requests per user. But multi-threading will of course speed up the download. Which tool do you use to combine netcdf files? Does it handle NetCdf3 and 4 on both Windows and Linux OS? You can propose a GitHub merge request to integrate your updates. But we will have to validate to be sure of its robustness. Some datasets have file of more than 1GB per day, so merging one year of data will create huge files.
What is the extraction result size limit? Also, what is the limit to the number of requests per user? Does the server return a certain error if either limits have been reached?
I'm using some scripts written by a former CLS intern, Juliet Noroit, which uses numpy and the netCDF4 module to copy data from a set of .nc files to a new .nc file. I have not tested it against ncdf3 files on both OS's. What Linux distribution and Windows OS do you test against?
That's a good user-input test case to consider. Perhaps I can have the code check and issue a warning about the resulting file size. The use case can be extended to download a single GB file into chunks: that way the download is faster and the user gets some data, even when certain parts fail.
In my use case, I typically need to download a few year's worth of data from a single service. I've written some code around the motu-client to split a user-provided time range into smaller time intervals, to invoke motu-client on each time-interval, and to combine the netcdf files back to a single file.
I haven't read too deeply into motu-client code but should I attempt to integrate this as a feature?
Currently, if you use motu-client to download data spanning a large time interval, the client may take hours to respond and it's not guaranteed to successfully download the file. If the download fails, the download attempt (and wait-time) is wasted and the user gets nothing!
This feature would allow portions of the dataset to be downloaded. If multithreading were used, a later feature can be added to issue simultaneous data download requests to speed up downloads. I've ran simultaneous instances of my script to download data in parallel so it seems like it's possible to implement.