Add `fetch` command to download expedition data based on `area_of_interest` from `schedule.yaml`

iuryt commented 2 weeks ago

Based on the discussion from #68

Description

This PR updates the fetch command to download data specified by the expedition's area_of_interest . Key changes include:

Integration with Schedule File: The fetch command now uses _get_schedule to load the Schedule YAML file from the specified expedition_dir. The Schedule now includes area_of_interest with spatial and temporal boundaries, which define the download region and timeframe for data subsets.
Unified Data Download Configuration: All datasets, including bathymetry, are now defined in a single download_dict, making the code cleaner and allowing for easy addition or modification of datasets without changes to the function logic.
Consistent Use of area_of_interest: All dataset downloads utilize area_of_interest for spatial (min/max latitude and longitude) and temporal (start/end time) boundaries, ensuring consistency across all downloaded data.

Should we delete the whole scripts/ folder?

Status

This is a draft PR. I have not tested running the fetch command with an example expedition_dir to ensure data is downloaded correctly based on area_of_interest.

iuryt commented 2 weeks ago

@ammedd and @VeckoTheGecko What do you think of this way I am implementing this? I was also wondering whether _get_schedule and _get_ship_config should be in do_expedition or somewhere else.

VeckoTheGecko commented 2 weeks ago

I was also wondering whether _get_schedule and _get_ship_config

I think schedule.py and ship_config.py would be more logical places to put them

VeckoTheGecko commented 2 weeks ago

Should we delete the whole scripts/ folder?

I think so

@iuryt where would the data be downloaded to do you think? I think it would be nice to have it in a subfolder in the mission folder, and also somehow tie in the spatial/temporal ranges to that dataset. Perhaps something like:

mission_folder
└── data
    └── vSgsJWk
        ├── a.nc
        ├── b.nc
        ├── c.nc
        ├── d.nc
        └── schedule.yaml

where vSgsJWk is a hash of the spatial/temporal information. If users change the region/time range it would generate a new folder. Schedule.yaml is just a copy of the file so that users can know what spatial/temporal settings they had.

Not perfect (i.e., changing any of the domains results in a full data download) but at least it avoids re-downloading already downloaded data, and isolates the data files for different areas.

iuryt commented 1 week ago

Should we delete the whole scripts/ folder?

I think so

@iuryt where would the data be downloaded to do you think? I think it would be nice to have it in a subfolder in the mission folder, and also somehow tie in the spatial/temporal ranges to that dataset. Perhaps something like:
mission_folder
└── data
    └── vSgsJWk
        ├── a.nc
        ├── b.nc
        ├── c.nc
        ├── d.nc
        └── schedule.yaml
where vSgsJWk is a hash of the spatial/temporal information. If users change the region/time range it would generate a new folder. Schedule.yaml is just a copy of the file so that users can know what spatial/temporal settings they had.

Not perfect (i.e., changing any of the domains results in a full data download) but at least it avoids re-downloading already downloaded data, and isolates the data files for different areas.

Thank you so much for your suggestions! I’ve been thinking about how to put them into action.

One idea I had is to keep all the data in a single folder and just check the metadata for any changes in the spatial-temporal domain. If the user changes the domain, we can easily overwrite the data. I’m curious, though—why might someone need access to the previous data? To make things easier and skip loading the metadata each time, we could simply check if the copied data/schedule.yaml matches the current schedule.yaml domain.

I really like the concept of using a hash, but I’m a bit concerned it might not be the friendliest option for those who aren’t familiar with it, especially students. It might require some extra explanation during lectures. I’d love to hear what @ammedd and @erikvansebille think about this!

erikvansebille commented 1 week ago

I really like the concept of using a hash, but I’m a bit concerned it might not be the friendliest option for those who aren’t familiar with it, especially students. It might require some extra explanation during lectures. I’d love to hear what @ammedd and @erikvansebille think about this!

Hi @iuryt, I agree that a hash-based folder name might not be very friendly to work with for students. Perhaps a timestamp-based folder would be friendlier? E.g. mission_folder/data/download_YYYMMDD_HHMMSS/*.

That would also help easily locate the latest version of the downloaded data because the downloads would automatically be sorted.

iuryt commented 1 week ago

I really like the concept of using a hash, but I’m a bit concerned it might not be the friendliest option for those who aren’t familiar with it, especially students. It might require some extra explanation during lectures. I’d love to hear what @ammedd and @erikvansebille think about this!

Hi @iuryt, I agree that a hash-based folder name might not be very friendly to work with for students. Perhaps a timestamp-based folder would be friendlier? E.g. mission_folder/data/download_YYYMMDD_HHMMSS/*.

That would also help easily locate the latest version of the downloaded data because the downloads would automatically be sorted.

That makes sense. But do you think it is helpful to keep past downloads?

erikvansebille commented 1 week ago

But do you think it is helpful to keep past downloads?

Well, we don't want people to have to redownload every time they run the virtual ship, I assume. We could only keep the last download, but then why even have a folder per download (the idea of a hash in the first place)?

iuryt commented 1 week ago

But do you think it is helpful to keep past downloads?

Well, we don't want people to have to redownload every time they run the virtual ship, I assume. We could only keep the last download, but then why even have a folder per download (the idea of a hash in the first place)?

I was wondering if we could check for updates to the schedule.yaml file compared to the copy in the download folder. We could only overwrite it if there were changes in the spatial-temporal domain. This way, we wouldn’t need separate folders for each download. Let me know your thoughts.

VeckoTheGecko commented 1 week ago

I'm still a fan of hashes as I think it simplifies the implementation details and adds a bit of futureproofing (i.e., what if we want to "force" new data downloads with a new version of virtualship because we add a new dataset, or the download command has changed. Doing so with a hash is a change to a single line of code in the hashing function. Also what if a new version of VS changes the format of schedule.yaml?). The concept of hashing isn't important to students, all they need to know is: "Defining the spatial/temporal domain in schedule.yaml will make it so that data is downloaded to a unique folder. If you re-use the same domain previous downloads will be found". If users want to look at the details of the download, they can look into the folder at schedule.yaml. Doing a purely timestamp based system would require loading each schedule, which can be sidestepped via hashing. We could have the best of both worlds by doing data/YYYYMMDD_HHMMSS_vSgsJWk so its human readable and functional with the hash.

This is all underpinned by whether re-using the data is important. My conversations with @ammedd mentioning that these data downloads as sizeable and take a long time (if someone can advise how long/how large, since I don't know this) leads me to be inclined to save the downloads and re-use them as much as practical, and leave managing the storage space to the user. A "one folder per domain" approach takes more space, but allows the user to change the domain as they see fit without worrying about it being cleared out from a separate run they did.

One idea I had is to keep all the data in a single folder

As in, across different missions the data would also be saved to the same folder? (e.g., a folder that they're prompted to specify the first time that they run VS?) I initially suggested in https://github.com/OceanParcels/virtualship/pull/83#issuecomment-2478056334 the data being downloaded in the mission folder so that it was very visible to the user and centralised. I think the mission folder approach is easier to implement (no need to worry about (a) permissions elsewhere on the file system, (b) users not knowing where the data is stored, (c) setting/saving application level configuration), although users would need to manually copy data files between missions if they want to avoid doing a download. I think that trade-off would be worth it

Sorry about the wall of text :)

ammedd commented 1 week ago

Wow! So many ideas/much work going on!

I like saving data to the mission folder and the data/YYYYMMDD_vSgsJWk human readable and hash option sounds like the best of both worlds. Stating the date of the downloaded dataset, not the date of the download ;-)

If I remember correctly downloads took up to 2 hours each last year and my 256GB disk couldn't handle all the data needed for the 7 groups we had last year. I do think the download method (and speed) has improved since then.

erikvansebille commented 1 week ago

Maybe this is because the PR is still in draft, but when I tried to use it (to download data for a drifter simulation), I got the following error

(parcels) erik ~/Codes/VirtualShip % virtualship fetch TrAtldrifters
Traceback (most recent call last):
  File "/Users/erik/anaconda3/envs/parcels/bin/virtualship", line 8, in <module>
    sys.exit(cli())
             ^^^^^
  File "/Users/erik/anaconda3/envs/parcels/lib/python3.12/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/erik/anaconda3/envs/parcels/lib/python3.12/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/Users/erik/anaconda3/envs/parcels/lib/python3.12/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/erik/anaconda3/envs/parcels/lib/python3.12/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/erik/anaconda3/envs/parcels/lib/python3.12/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: fetch() got an unexpected keyword argument 'path'

Any idea what's going on?

VeckoTheGecko commented 1 week ago

ship_config

I was also wondering whether _get_schedule and _get_ship_config

I think schedule.py and ship_config.py would be more logical places to put them

On second thought, I think let's defer this decision. I think splitting these similar functions into different files has it's own cost.

VeckoTheGecko commented 1 week ago

@erikvansebille done with the credential loading, path patch, and other minor review items. I haven't tested beyond that though.

Should I lay the groundwork the cache folders with hashing?

iuryt commented 1 week ago

Hi all, sorry for not finishing this. I have a deadline for a paper revision this Friday and will work on this PR right after that.

VeckoTheGecko commented 1 week ago

All good. Let me know if there's anything I can do to help :)

erikvansebille commented 6 days ago

Note that I just got an email from Copernicus Marine Service that they are about to update their copernicusmarine to v2.0.0, which has some breaking changes: https://help.marine.copernicus.eu/en/articles/9978784-what-s-new-in-the-pre-release-of-version-2-0-0-of-the-copernicus-marine-toolbox.

So it might be good to already implement the new API in virtualship fetch

VeckoTheGecko commented 6 days ago

Good to know about v2! I didn't see a timeline for the release of v2, only "soon". I think it would be good to pin <2 for the timebeing so that we aren't coupled to their development and add v2 support later? Takes out the uncertainty especially around the holiday season.

OceanParcels / virtualship

Add `fetch` command to download expedition data based on `area_of_interest` from `schedule.yaml` #83

Description

Status