Open iuryt opened 2 weeks ago
@ammedd and @VeckoTheGecko What do you think of this way I am implementing this? I was also wondering whether _get_schedule and _get_ship_config should be in do_expedition or somewhere else.
I was also wondering whether _get_schedule and _get_ship_config
I think schedule.py
and ship_config.py
would be more logical places to put them
Should we delete the whole
scripts/
folder?
I think so
@iuryt where would the data be downloaded to do you think? I think it would be nice to have it in a subfolder in the mission folder, and also somehow tie in the spatial/temporal ranges to that dataset. Perhaps something like:
mission_folder
└── data
└── vSgsJWk
├── a.nc
├── b.nc
├── c.nc
├── d.nc
└── schedule.yaml
where vSgsJWk
is a hash of the spatial/temporal information. If users change the region/time range it would generate a new folder. Schedule.yaml is just a copy of the file so that users can know what spatial/temporal settings they had.
Not perfect (i.e., changing any of the domains results in a full data download) but at least it avoids re-downloading already downloaded data, and isolates the data files for different areas.
Should we delete the whole
scripts/
folder?I think so
@iuryt where would the data be downloaded to do you think? I think it would be nice to have it in a subfolder in the mission folder, and also somehow tie in the spatial/temporal ranges to that dataset. Perhaps something like:
mission_folder └── data └── vSgsJWk ├── a.nc ├── b.nc ├── c.nc ├── d.nc └── schedule.yaml
where
vSgsJWk
is a hash of the spatial/temporal information. If users change the region/time range it would generate a new folder. Schedule.yaml is just a copy of the file so that users can know what spatial/temporal settings they had.Not perfect (i.e., changing any of the domains results in a full data download) but at least it avoids re-downloading already downloaded data, and isolates the data files for different areas.
Thank you so much for your suggestions! I’ve been thinking about how to put them into action.
One idea I had is to keep all the data in a single folder and just check the metadata for any changes in the spatial-temporal domain. If the user changes the domain, we can easily overwrite the data. I’m curious, though—why might someone need access to the previous data? To make things easier and skip loading the metadata each time, we could simply check if the copied data/schedule.yaml
matches the current schedule.yaml
domain.
I really like the concept of using a hash, but I’m a bit concerned it might not be the friendliest option for those who aren’t familiar with it, especially students. It might require some extra explanation during lectures. I’d love to hear what @ammedd and @erikvansebille think about this!
I really like the concept of using a hash, but I’m a bit concerned it might not be the friendliest option for those who aren’t familiar with it, especially students. It might require some extra explanation during lectures. I’d love to hear what @ammedd and @erikvansebille think about this!
Hi @iuryt, I agree that a hash-based folder name might not be very friendly to work with for students. Perhaps a timestamp-based folder would be friendlier? E.g. mission_folder/data/download_YYYMMDD_HHMMSS/*
.
That would also help easily locate the latest version of the downloaded data because the downloads would automatically be sorted.
I really like the concept of using a hash, but I’m a bit concerned it might not be the friendliest option for those who aren’t familiar with it, especially students. It might require some extra explanation during lectures. I’d love to hear what @ammedd and @erikvansebille think about this!
Hi @iuryt, I agree that a hash-based folder name might not be very friendly to work with for students. Perhaps a timestamp-based folder would be friendlier? E.g.
mission_folder/data/download_YYYMMDD_HHMMSS/*
.That would also help easily locate the latest version of the downloaded data because the downloads would automatically be sorted.
That makes sense. But do you think it is helpful to keep past downloads?
But do you think it is helpful to keep past downloads?
Well, we don't want people to have to redownload every time they run the virtual ship, I assume. We could only keep the last download, but then why even have a folder per download (the idea of a hash in the first place)?
But do you think it is helpful to keep past downloads?
Well, we don't want people to have to redownload every time they run the virtual ship, I assume. We could only keep the last download, but then why even have a folder per download (the idea of a hash in the first place)?
I was wondering if we could check for updates to the schedule.yaml
file compared to the copy in the download folder. We could only overwrite it if there were changes in the spatial-temporal domain. This way, we wouldn’t need separate folders for each download. Let me know your thoughts.
I'm still a fan of hashes as I think it simplifies the implementation details and adds a bit of futureproofing (i.e., what if we want to "force" new data downloads with a new version of virtualship because we add a new dataset, or the download command has changed. Doing so with a hash is a change to a single line of code in the hashing function. Also what if a new version of VS changes the format of schedule.yaml
?). The concept of hashing isn't important to students, all they need to know is: "Defining the spatial/temporal domain in schedule.yaml
will make it so that data is downloaded to a unique folder. If you re-use the same domain previous downloads will be found". If users want to look at the details of the download, they can look into the folder at schedule.yaml
. Doing a purely timestamp based system would require loading each schedule, which can be sidestepped via hashing. We could have the best of both worlds by doing data/YYYYMMDD_HHMMSS_vSgsJWk
so its human readable and functional with the hash.
This is all underpinned by whether re-using the data is important. My conversations with @ammedd mentioning that these data downloads as sizeable and take a long time (if someone can advise how long/how large, since I don't know this) leads me to be inclined to save the downloads and re-use them as much as practical, and leave managing the storage space to the user. A "one folder per domain" approach takes more space, but allows the user to change the domain as they see fit without worrying about it being cleared out from a separate run they did.
One idea I had is to keep all the data in a single folder
As in, across different missions the data would also be saved to the same folder? (e.g., a folder that they're prompted to specify the first time that they run VS?) I initially suggested in https://github.com/OceanParcels/virtualship/pull/83#issuecomment-2478056334 the data being downloaded in the mission folder so that it was very visible to the user and centralised. I think the mission folder approach is easier to implement (no need to worry about (a) permissions elsewhere on the file system, (b) users not knowing where the data is stored, (c) setting/saving application level configuration), although users would need to manually copy data files between missions if they want to avoid doing a download. I think that trade-off would be worth it
Sorry about the wall of text :)
Wow! So many ideas/much work going on!
I like saving data to the mission folder and the data/YYYYMMDD_vSgsJWk
human readable and hash option sounds like the best of both worlds. Stating the date of the downloaded dataset, not the date of the download ;-)
If I remember correctly downloads took up to 2 hours each last year and my 256GB disk couldn't handle all the data needed for the 7 groups we had last year. I do think the download method (and speed) has improved since then.
Maybe this is because the PR is still in draft, but when I tried to use it (to download data for a drifter simulation), I got the following error
(parcels) erik ~/Codes/VirtualShip % virtualship fetch TrAtldrifters
Traceback (most recent call last):
File "/Users/erik/anaconda3/envs/parcels/bin/virtualship", line 8, in <module>
sys.exit(cli())
^^^^^
File "/Users/erik/anaconda3/envs/parcels/lib/python3.12/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/erik/anaconda3/envs/parcels/lib/python3.12/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "/Users/erik/anaconda3/envs/parcels/lib/python3.12/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/erik/anaconda3/envs/parcels/lib/python3.12/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/erik/anaconda3/envs/parcels/lib/python3.12/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: fetch() got an unexpected keyword argument 'path'
Any idea what's going on?
ship_config
I was also wondering whether _get_schedule and _get_ship_config
I think
schedule.py
andship_config.py
would be more logical places to put them
On second thought, I think let's defer this decision. I think splitting these similar functions into different files has it's own cost.
@erikvansebille done with the credential loading, path patch, and other minor review items. I haven't tested beyond that though.
Should I lay the groundwork the cache folders with hashing?
Hi all, sorry for not finishing this. I have a deadline for a paper revision this Friday and will work on this PR right after that.
All good. Let me know if there's anything I can do to help :)
Note that I just got an email from Copernicus Marine Service that they are about to update their copernicusmarine
to v2.0.0, which has some breaking changes: https://help.marine.copernicus.eu/en/articles/9978784-what-s-new-in-the-pre-release-of-version-2-0-0-of-the-copernicus-marine-toolbox.
So it might be good to already implement the new API in virtualship fetch
Good to know about v2! I didn't see a timeline for the release of v2, only "soon". I think it would be good to pin <2 for the timebeing so that we aren't coupled to their development and add v2 support later? Takes out the uncertainty especially around the holiday season.
Based on the discussion from #68
Description
This PR updates the
fetch
command to download data specified by the expedition'sarea_of_interest
. Key changes include:Integration with Schedule File: The
fetch
command now uses_get_schedule
to load theSchedule
YAML file from the specifiedexpedition_dir
. TheSchedule
now includesarea_of_interest
with spatial and temporal boundaries, which define the download region and timeframe for data subsets.Unified Data Download Configuration: All datasets, including bathymetry, are now defined in a single
download_dict
, making the code cleaner and allowing for easy addition or modification of datasets without changes to the function logic.Consistent Use of
area_of_interest
: All dataset downloads utilizearea_of_interest
for spatial (min/max latitude and longitude
) and temporal (start/end time
) boundaries, ensuring consistency across all downloaded data.Should we delete the whole
scripts/
folder?Status
This is a draft PR. I have not tested running the
fetch
command with an exampleexpedition_dir
to ensure data is downloaded correctly based onarea_of_interest
.