ArgoCanada / argoFloats

Tools for analyzing collections of oceanographic Argo floats
https://argocanada.github.io/argoFloats/index.html
17 stars 7 forks source link

getIndex should be able to accept a locally downloaded file instead of needing to download? #577

Closed richardsc closed 2 years ago

richardsc commented 2 years ago

See #576 . It seems reasonable that getIndex() should be able to read a locally downloaded index file (and create the correct .rda in the argoDefaultDestDir() accordingly).

An additional advantage to this is that it makes for a completely reproducible workflow, because the index won't update in the middle of an analysis.

dankelley commented 2 years ago

We don't want to break existing code, and we don't want to get into guessing games. Therefore, we cannot code this by checking whether file is a string naming a local file. Just as an example, the default value is file="core", and a user might have a file of that name that does anything, maybe something to do with the geophysics of the deep earth :-)

I think the best would be to add a new arg called "localFile" or something. If that is set, then we read that file, and ignore the args "file", "server" and "age arguments. This would be pretty easy to code, and document. The former means we aren't likely to break existing functionality, and the latter means we won't confuse users.

Any thoughts on this, @j-harbin? I can code it tomorrow morning, or you could take a stab (in a branch named for this issue).

dankelley commented 2 years ago

Having turned this offer in my mind, I think the solution is to have a new logical parameter, local. I always like to start by writing docs, and so I hope to get comments on the following.

1. Docs for new arg. I propose to add

#' @param local a logical value that indicates whether `file` is the name of a local file that
#' holds a local file, as opposed to an indication of a non-local file to be downloaded from a `server`.

2. Change of docs for filename arg. I propose to change

#' @param filename character value that indicates the file name on the server, as in...

to something like

#' @param filename character value that indicates either the name of an index file. If `local`
#' is TRUE, then this is the name of a local file, as might have been downloaded previousl
#' from an argo server.  Otherwise, `filename` indicates the file name on the server, as in...
dankelley commented 2 years ago

I am going to code this today

dankelley commented 2 years ago

Oh my. I've been working on this for a few hours and am near the end but now I see the problem with the idea. I discovered before that the tar.gz file does not contain sufficient information to reliably get profiles. I don't have full notes in front of me, but here are the comments I see in the code for getIndex():

       ## The Construction of the remote filename is tricky was changed on 2020-04-30. Previously,
        ## we used "ftpRoot", inferred from the "# FTP" line in the header in the index file.
        ## However, as discussed at https://github.com/ArgoCanada/argoFloats/issues/82,
        ## this decision was faulty, in the sense that it sometimes worked and sometimes
        ## failed, for the *same* index object -- that is, the file structure on the server
        ## must be changeable.  So, as a test (which will need to be reevaluated over time),
        ## we switched to using "server" instead of "ftpRoot".  That means that we seek
        ## the NetCDF files from the same source as the index, and *not* from the source listed
        ## in the "# FTP" line in the header within that index.

So, adding this local feature is not straightforward.

Anyway, it is trivial to use standard R methods to save the results of a call to getIndex(). Just do something like

library(argoFloats)
ai <- getIndex()
save(ai, file="argoIndex_20220625.rda")

and later use

load("argoIndex_20220625.rda")
# now use `ai`...

instead of

ai <- getIndex()
# now use `ai` ...
dankelley commented 2 years ago

I am pushing what I have to issue577, but note that it is not working, and my changes will be abandoned.

dankelley commented 2 years ago

@j-harbin you might want to read through this next week. I will close this issue at noon on Monday.

j-harbin commented 2 years ago

I think this seems sensible. When I go over to issue577 branch, I have a few suggestions.

  1. I'm receiving the following error, which is likely a result of me not understanding how to put it a local file, but maybe this suggests we should add some examples to the documentation. (Note I removed some of the Warning print out statements, as there were over 100 that said the same thing)
library(argoFloats)
a <- getIndex(filename="/Users/jaimiekeeping/data/argo/D4901828_034.nc", local=TRUE)
#> Warning in readLines(destfileTemp, 100): line 1 appears to contain an embedded
#> nul
#> Warning in readLines(destfileTemp, 100): line 2 appears to contain an embedded
#> nul
#> Warning in readLines(destfileTemp, 100): line 3 appears to contain an embedded
#> nul
#> Warning in readLines(destfileTemp, 100): line 4 appears to contain an embedded
#> nul
#> Warning in readLines(destfileTemp, 100): line 5 appears to contain an embedded
#> nul
#> Error in strsplit(first[1 + lastHash], ",")[[1]]: subscript out of bounds

Created on 2022-06-27 by the reprex package (v2.0.1)

  1. In the docs for getIndex(), where the local argument is described, it says "a logical value, FALSE by default, indicating whether file is the name of a local file that was previously downloaded from a server.", it should say filename and not file
  2. When I do the following (and forgot to change local=TRUE, I received the following error. I recommend mentioning something such as "or set local=TRUE if you are trying to read a local file".
    library(argoFloats)
    a <- getIndex(filename="/Users/jaimiekeeping/data/argo/D4901828_034.nc")
    #> Error in getIndex(filename = "/Users/jaimiekeeping/data/argo/D4901828_034.nc"): filename="/Users/jaimiekeeping/data/argo/D4901828_034.nc" doesn't exist. Try one of these: "core", "bgc", "bgcargo", "traj", "bio-traj", "synthetic","ar_index_global_prof.txt.gz", "argo_bio-profile_index.txt.gz", "ar_index_global_traj.txt.gz", "argo_bio-traj_index.txt.gz", or "argo_synthetic-profile_index.txt.gz".

Created on 2022-06-27 by the reprex package (v2.0.1)

Created on 2022-06-27 by the reprex package (v2.0.1)

  1. Is the idea that if a person provided a local file to "get" that the output of getIndex would be an argos type and not a index type, if so I think the docs need to be updated for this (I could be misunderstanding this).

I'm happy to Z about this if you think that's useful.

dankelley commented 2 years ago

Is 0800h ok for a Z?

j-harbin commented 2 years ago

You bet!

richardsc commented 2 years ago

Not going to be able to join your zoom (you folks are too early for me!), but at the very least Jaimie I see that you are trying to read a profile netcdf not the index CSV file that was proposed here.

(IIRC readProfiles() should be able to read a bunch of already-downloaded netcdf files (which is the point), but there is also read.argo() in oce.)

richardsc commented 2 years ago

When I try branch issue577, after downloading an index file, I get:

> library(argoFloats)
> i <- getIndex('~/Downloads/ar_index_global_prof.txt.gz', local=TRUE)
Error in save(argoFloatsIndex, file = destfileRda) : 
  object 'destfileRda' not found
dankelley commented 2 years ago

Here's the upshot from the meeting that @j-harbin and I discussed ...

I think this is all moot, because, as I explained in https://github.com/ArgoCanada/argoFloats/issues/577#issuecomment-1166258368, users will need to keep a record of the server from which they got the index. That information is required for getProfiles() to recover files.

Long story short, a tar.gz file from the server is not sufficient to later download profile files, simply because those .tar.gz files do not state the server. (Different servers hold different indices, and different profile files.

We looked into this in detail a year ago, and I think the comment to which I refer above has some pointers.

I propose to simply drop this planned change. It won't work for users. And the alternative is trivial: just do

i <- getIndex()
save(i, file="my_nice_index.rda")

and later on, just do

load("my_nice_index.rda")

and then proceed to use i as though it had come from a getIndex() call done today.

@j-harbin, please add comments if you wish to do so.

@richardsc, please consider closing this.

richardsc commented 2 years ago

I think I understand now. Sorry for the confusion! (and wasted time). Too bad the indices aren't more consistently formatted 😞