JuliaClimate / CDSAPI.jl

Julia API to the Climate Data Store (a.k.a. CDS)
https://cds.climate.copernicus.eu
MIT License
23 stars 6 forks source link

Reworking CDSAPI.jl to be more Julia-centric #37

Closed natgeo-wong closed 3 years ago

natgeo-wong commented 4 years ago
juliohm commented 4 years ago

Thank you @natgeo-wong for the nice refactoring and logging. Could you please update the tests accordingly so that we have a clearer picture of the proposed API?

Appreciate if you can review the PR @LakshyaKhatri. It is also an opportunity to figure out if that issue about decryption is still present.

LakshyaKhatri commented 4 years ago

Hello @juliohm! Sorry for the delayed response. I think we should test these changes locally before merging them. Also, since this is an external pull request, Travis is turning the environment variables off to prevent modification of the encrypted files and that is why the tests are failing.

Also, @natgeo-wong, thank you so much for these beautiful modifications. The code looks neat.

I will first test the current changes locally and then we will be ready to merge the changes. After that, I will check what can be done on the Travis part to resolve the issue.

natgeo-wong commented 4 years ago

I revamped the set of tests, because the output is quite different. You could add your own tests to this as well?

LakshyaKhatri commented 4 years ago

The current version of cdskeys() is showing timestamps like this:

time_stamp


IMO, having a timestamp won't be that much of a help. Also, for showing a timestamp we are adding an additional dependency of Dates.jl (which could be overkill). @juliohm, it would be helpful to have your views on this.

juliohm commented 4 years ago

Thanks for pinging @LakshyaKhatri , I think the dependency on Dates is fine as it is a stdlib shipped with Julia. My main concern is actually the verbosity of these messages. It is nice to have logging features, but we should consider adding a verbose option to do things quietly if possible. This option could default to true. What do you think?

natgeo-wong commented 4 years ago

The verbosity of these messages are very similar to the CDSAPI by Python, and actually I like them because I can monitor the progress of the downloads.

When I was using the initial version of CDSAPI, my datasets are extremely large and take a long time to download, so the CDSAPI script kept throwing "Request is running" every second, which I believe is even more verbose for large datasets.

Reason why I added the log for the cdskeys() is so that people are reminded that they shouldn't be continuously calling the cdskeys() function in a loop when once will do.

Also, I should add that when I have more time I'm going to try and add a progress bar option to HTTP.jl for downloads, which would be cool.

LakshyaKhatri commented 4 years ago

My main concern is actually the verbosity of these messages. It is nice to have logging features, but we should consider adding a verbose option to do things quietly if possible. This option could default to true. What do you think?

@juliohm could you provide some more information. I am still linking this with the timestamps that are added by @natgeo-wong in the info logs!

juliohm commented 4 years ago

Also, I should add that when I have more time I'm going to try and add a progress bar option to HTTP.jl for downloads, which would be cool.

That would be very cool indeed @natgeo-wong :100:

Also, I still think we should consider adding a verbose option to the download command. Sometimes users are not interested in verbosity, for example when running multiple processes in a computer cluster. I had similar issue with the Pkg manager recently and they are considering adding a verbose option there as well. Imagine multiple processes downloading different datasets.

We could just follow a pattern verbose && @info "foo" throughout the code. Shouldn't be that hard to add this option.

Regarding the dependency on Crayons, do we really want to add an extra dependency just for a boldface?

natgeo-wong commented 4 years ago

We could just follow a pattern verbose && @info "foo" throughout the code. Shouldn't be that hard to add this option.

I have done this in the most recent push

Regarding the dependency on Crayons, do we really want to add an extra dependency just for a boldface?

I could remove it haha, I just like nice formatting.

juliohm commented 4 years ago

Would it be ok to remove the Crayons dependency? Given that the package is very small, and that we are not using formatting all over the place, I think it would be better to remove it. I didn't check the loading time, but the less dependencies the better in the long-term.

juliohm commented 4 years ago

@natgeo-wong did you have a chance to take a look into this? It would be nice to update the PR and merge the improvements.

natgeo-wong commented 4 years ago

Hi @juliohm and @LakshyaKhatri, sorry I've just been pretty busy these past few days, I'll go through and have a look!

natgeo-wong commented 4 years ago

Done!

juliohm commented 4 years ago

Awesome @natgeo-wong , do we also need an update to the README? I will try test the PR locally tomorrow :+1:

natgeo-wong commented 4 years ago

An update to the README would be good. I'll work on the README over this coming week as a way to soothe myself when everything starts getting extremely crazy on Tuesday.

juliohm commented 4 years ago

Hey @natgeo-wong , do you think you could manage a final stretch on this PR this week? I know it is been a crazy week for you. :)

Looking forward to review the changes.

natgeo-wong commented 4 years ago

This week is good! Expect updates today or tomorrow! Just to confirm, the tests are passing for both you and Lakshya?

natgeo-wong commented 4 years ago

Done!

juliohm commented 4 years ago

I've just cloned the repo and tried the changes locally, and unfortunately I don't think we can merge the PR as is. There are many issues including the fact that tests with the ERA5 in GRIB format were erased, NetCDF was replaced with NCDatasets without a corresponding update in the Project.toml, and other source code formatting issues. Overall, it is more difficult to understand what the code is doing now compared to the master branch.

Taking a step back, this PR is basically adding a "verbose" feature, i.e. more control over the messages printed to the terminal. If you would like to start a fresh PR to add this feature, we'd be happy to review it.

natgeo-wong commented 4 years ago

I am sorry about the tests. I think I mentioned early on that I am not too sure how Testing works in Julia and I'm still trying to figure it out, because I do not use it regularly.

This PR is more than just "verbosity". When I first looked at CDSAPI.jl, I was very uncertain as to how to use it because of py2ju. I've gotten rid of that, because the crux of CDSAPI.jl should be on Julia, and not on "conversion" measures from Python to Julia. I didn't even know what the nature of the inputs were until I had a very close look at the code and realized that what py2Ju was doing was simply returning a dictionary. So I didn't even know what inputs I was supposed to be using, and that is offputting to people who are still trying out the basics of Julia.

I would argue that when one comes from Julia, one finds the master branch of CDSAPI harder to absorb because of conflicting information and priorities aimed to "converting" a python code or dictionary or information into Julia, when one should simply be using Julia code directly. I edited the code to be able to do that, specifically by restricting the types of the input argument, which should already be giving people familiar with Julia a sense of what sort of input arguments are required.

I've also pulled out the retrieval of the cdskeys as a function on its own, so people don't need to keep rereading the cdskeys each time when we use loops to retrieve large datasets (over different months/years, for example).

I can probably restore some of the GRIB tests, etc. from the master branch. I don't use those formats, so I am predisposed to rely only on NCDatasets, but I will do my best.

My key message is this: We really need to remove the dependency of py2ju, because that is just confusing.

TLDR, what I did was:

  1. Restrict the nature of input arguments to make it clearer for Julia users
  2. Remove py2ju because we really should not be relying on python syntax in Julia code. Instead, show users how to input the arguments in Julia code in the README/documentation (I'll update the README further).
  3. Pulling out minor functions (see below), so that they can be individually modified without disturbing the overall code. a. Retrieval of cdskeys b. Parsing of HTTP request
  4. Allow for the retrieval to fail, otherwise the initial one looks as though it will be stuck in a loop (lines 36-40 of the initial code).
juliohm commented 4 years ago

TLDR, what I did was:

  • Restrict the nature of input arguments to make it clearer for Julia users

Restricting input arguments is a Julia anti-pattern, see section "Over-constraining argument types" https://www.oxinabox.net/2020/04/19/Julia-Antipatterns.html

  • Remove py2ju because we really should not be relying on python syntax in Julia code. Instead, show users how to input the arguments in Julia code.

Isn't it true that the current master branch supports Julia dicts just fine? We are only providing an additional helper function py2ju in case someone doesn't want to build the dict by hand. The CDS website creates Python queries for you, and converting these queries by hand can be a waste of time. What we should do is clarify in the README that Julia dicts are supported.

  • Pulling out minor functions (see below), so that they can be individually modified without disturbing the overall code. a. Retrieval of cdskeys b. Parsing of HTTP request

To be honest I don't see much use in a separate cdskey function, but we can review a separate PR with this improvement. Regarding the parsing of HTTP, this is the part of the code that I thought could be reformatted. What about we start two separate PRs with these changes?

natgeo-wong commented 3 years ago

Hi, sorry for the silence.

I will make the PRs when I have time, now is crunch time with class projects coming up.

I still believe that specifying the input types will be helpful, especially if people are coming over from Python to Julia, for new users to understand exactly how the Julia portions of the code work. I stand by my previous statement that specifying it is necessary.

I will make separate PRs for the parts you've specified.

juliohm commented 3 years ago

Awesome. I will close this PR for now, and then we can start fresh PRs with individual changes when you are over your other tasks in your todo list.