Open Zeitsperre opened 2 years ago
the idea of creating POSIX paths with escaped spaces runs counter to all known ethics and reason
True story. The great Aristotle once wrote "Δεν μου αρέσουν τα κενά στα ονόματα αρχείων."
If there is a good reason to keep "AerChemMIP", I'd agree with hyphens. If we are only downloading "ScenarioMIP" (and their historical counterparts), I would say we drop the other names it into /dev/null
, never to be seen again.
Is AerChemMIP
something that we plan on ever using? If not, is it really an issue to remove it?
If we want to keep both, I'd say that for both the folder structure & the catalog, using an hyphen would quickly turn into a nightmare due to all the possible combinations.
As we specifically downloaded that data for ScenarioMIP, I would only keep that. In the catalog, people might search for all ScenarioMIP data and would not find whatever is in ScenarioMIP-AerChemMIP.
Also, I think there are other experiments that are part of more then 2 MIPS. It would get complicated quickly... I suggest that we only ever keep one. We can have a list of our ordered preference.
I would be OK with the option of dropping (setting a preferred order would work well) if that behaviour could be configured for multiple use cases. Having some kind of option in the restructure_datasets function would be best, but how best to specify this ?
When decoding, the validation step demands a string that is a member of the CMIP6 controlled vocabulary. I can change this to allow for a list of allowed values, then check that they members of the controlled vocabulary. This would better handle cases of files being shared between 3 or more MIPs (do those exist?).
Another option would be to create two entries for the file, one according to each MIP, and hard-link those files so that they can be found in either filetree (ScenarioMip/this/that/file.nc and AerChemMip/this/that/file.nc). This solves the catalogue issue by creating two entries while not increasing the disk space used. This approach is a bit overkill, but would be surprisingly easy to implement.
I feel like we all have opinions on this.
I like the magical symlink solution! If it is easy to implement!
I thought there might be experiements with more than 2 MIPs, because I has seen a well populated column called synergies with other MIPS
is the description of LS3MIP experiments (Van Den Hurk et al, 2016). But, looking at the list here (https://wcrp-cmip.github.io/CMIP6_CVs/docs/CMIP6_experiment_id.html), I only see duos.
If it is easy to have it on both ScenarioMIp and AerMIP without taking too much space, that is great!!
My understanding is that the 'real' file would only be at one location, but both filetrees would see it. So it takes the same space as only having it once in ScenarioMIP.
The only major issue with hard links is that if you perform certain operations (like copying hard linked files to another host), unless you specify to preserve hard links, you will break them (i.e. you will have two separate files) or if you modify one file, the other is modified as well. It's something that needs to be taken into consideration.
I can open a PR to address this in the coming weeks.
Just a reminder that we still have ScenarioMIP-AerChemMIP in the path. I think the conclusion here is to have ScenarioMIP and AerChemMIP with everything in ScenarioMIP-AerChemMIP in both directories with a hard link.
Not crucial as my catalog sees everything as ScenarioMIP. But this is a reminder that for the final form of /datasets, this needs to be addressed.
The decoder currently treats the entire string of a
attrs.ativity_id
for CMIP6-endorsed MIPs as the activity, however I ran into this today in our database:Since this field is used in creating the filetree, while it is technically valid to have spaces in a path, the idea of creating POSIX paths with escaped spaces runs counter to all known ethics and reason.
Proposal - hyphening:
ScenarioMIP AerChemMIP
→ScenarioMIP-AerChemMIP
Thoughts?