File format transition - Githubissues

egthomas commented 7 years ago

Throughout the RST there is the following reference to supporting the dmap file format transition (circa July 2006):

/* File format transistion * ------------------------ * * When we switch to the new file format remove any reference * to "new". Change the command line option "new" to "old" and * remove "old=!new". */

This was never actually implemented so users still have to set the -new flag when dealing with rawacf/fitacf/grdmap/cnvmap format files; by default RST software will assume files are in the older dat/fit/grd/map format. Since it's been over a decade since the last .dat file was written by a radar (and we presumably won't return to that file format), should we go ahead and make the changes to the software described in the comment above?

I can foresee one argument against this being that it will break too many existing users' scripts, however I think it should be a straightforward modification to make and will be much more intuitive to new students or users of SuperDARN data.

ksterne commented 7 years ago

Certainly can see the issue with removing the '-new' flag that will break existing external codes. Also, what do with do with the "old" option? Would we remove the '-new' flag and create a '-old' flag? Does this break even more codes that might not use the '-new' flag?

egthomas commented 7 years ago

Right now if someone wants to automatically process SD data they need to determine whether it is a dat/fit/grd/map file or a rawacf/fitacf/grdmap/cnvmap file to know whether to throw the -new flag. If a script can handle that it should be easy to apply the -old flag for the opposite case.

pasha-ponomarenko commented 7 years ago

OK, I think this is an occasion when I have to wear my WG chair's hat. :=O

I am on a conservative side here. There is no urgent need for this change, neither will it provide any extra convenience to the users, much more like the other way around. Lots (most?) of people use RST as a "black box" (myself included), and making any extra modification to their trusted old scripts written years ago by some smart students might be an uphill task. We do not just want to let the others to know that we are working on RST by disrupting their operations, do we? (almost a joke!)

Futrthermore, at this stage we are still looking at refactoring the existing RST3.5 so it would be premature to change the format of the main routines without major changes to the routines themselves. The appropriate place for this would be our future PST (Perfect Software Toolkit), a refactored, optimised, cleaned, user friendly software utilising comprehensive data analysis algorithms (I must be dreaming!).

Also, it might be better to implement an automatic recognition of the file type by its extension rather than to rely on poorly informed users in setting the (in)appropriate keywords, and this sort of change would make sense right now.

egthomas commented 7 years ago

And here I was hoping we might be able to slip this one past you Pasha! My stance is probably obvious from having raised this issue/question in the first place, so I'll just point out that reading the file extension is similarly user-dependent since one can put whatever extension they like on a dmap file (I think?).

pasha-ponomarenko commented 7 years ago

Well, that's what we can also automatise in order to prevent further confusion.

ksterne commented 7 years ago

@egthomas, I agree with @pasha-ponomarenko here. Having the '-new' flags aren't causing any problems currently. Their only drawback is that they are a few more characters in a command line. Yes it's true that we have a bunch of comments about "some day" changing this. To maintain functionality, if we get rid of the "-new" flag, then we would have to add code that handles a "-old" flag. So, that sounds like the same amount of code and it's not streamlined there.

It is true that you can call the dmap files any name you'd like. I mean, I could make a bks.slartibartfast file, but you'd probably quickly forget what processing level or code was used to create that file without opening the contents of the file.

kkotyk commented 7 years ago

I'm really not disagreeing in that the flag '-new' isn't hurting, it's just not a meaningful or helpful name for that argument. If we have to use arguments to determine what the input filetype is I would much rather see something like "--input-type rawacf" or "--input-type dat", etc.

egthomas commented 7 years ago

The new/old flags do have a certain level of simplicity in that we've only ever used two sets of file formats with a clear transition to datamap format in 2006. I think it would take a lot more effort to implement automatic file detection or flags unique to the type of possible file input than just making the changes laid out by Rob in my original post in this topic. Whatever change we make (if any, and I agree with Simon that now is the time to do it) we need to keep a big picture view of the entire RST in mind

egthomas commented 7 years ago

Ugh this is what I get for trying to comment on my phone.

pasha-ponomarenko commented 7 years ago

@egthomas and @sshepherd, it looks like I am missing some important points here. Are you arguing against impementation of an automatic type detection to make_fit specifically or to ALL other routines as well? Would the proposed auromatisation of make_fit have any effects on functionality of the "downstream" code? I am not that familar with RST functionality, so if this is a more general issue, please explain it in a bit more detail. A block diagram, if you have one, would be helpful.

egthomas commented 7 years ago

Hi @pasha-ponomarenko - I don't have any block diagrams handy, but I'll try explaining in text. Apologies for the length.

Right now, the steps for processing our data from the most basic product (dat/rawacf [I'm going to ignore iqdats here]) to our highest level (map potential file) looks something like this:

make_fit -new single_rad_data.rawacf > single_rad_data.fitacf
make_grid -new single_rad_data.fitacf > single_rad_data.grdmap
combine_grid -new single_rad_data.fitacf ... other_rad_data.fitacf > sd_data.grdmap
map_grd -new sd_data.grdmap > sd_data.empty.cnvmap
map_addhmb -new sd_data.empty.cnvmap > sd_data.hmb.cnvmap
map_addimf -new sd_data.hmb.cnvmap > sd_data.imf.cnvmap
map_addmodel -new sd_data.imf.cnvmap > sd_data.model.cnvmap
map_fit -new sd_data.model.cnvmap > sd_data.cnvmap

Notice how all of these routines require the presence (or absence) of a -new flag to indicate whether the input file is in the "new" or "old" file format. The old file formats (dat, fit, grd, map) are a mix of binary (dat, fit) and plain-text files (grd, map) while the new formats (rawacf, fitacf, grdmap, cnvmap) are all in the datamap (or dmap) format.

So let's think about the automatic detection scheme proposed for make_fit. We'll probably only ever have .dat or .rawacf files as input for make_fit to produce fit or fitacf files (going away from the self-describing datamap format would be a huge waste of time and resources in my opinion). It's probably safe to assume that users aren't renaming the extensions on their dat or rawacf files, so it would be easy enough to implement the automatic detection for make_fit.

Ok, what about the other processing routines? Well first you also need to worry about the other tools for working with rawacf files: cat_raw, raw_cp, rawstat, and trim_raw. Let's hope the implementation of the automatic detection scheme was simple enough that you don't need to start linking other libraries or you'll have extra compilation testing to do.

Now you can move on to grd/grdmap format files where I think I've already shown one example where one of the larger SD groups (VT) uses non-standard file extensions for grid (and map) files. So not only do you need to make the automatic detection scheme flexible enough to handle the grd and grdmap formats, but you're also going to ask the VT crew to rename thousands of files and modify the software running their website to accommodate that. Likewise for the map/cnvmap files. And this is assuming no one else uses non-standard extensions on their higher-level data files (which I personally think is an excellent feature for allowing flexibility when testing file processing).

Maybe it's short-sighted to assume we'll never move to a newer file format, but given the overall resource limitations facing the SD community I don't see anyone putting much effort into that anytime soon. So for the foreseeable future we'll have the "old" (dat, fit, grd, map) and "new" (rawacf, fitacf, grdmap, cnvmap) file formats to deal with. I'll again note that the SD community has not produced any old dat files in over a decade (since ~June 2006) and the quantity of "new" file formats we've produced since then have vastly outnumbered the old due to the proliferation of radars.

So to summarize on why I think we should replace the -new flag with an -old flag instead of implementing an automatic detection scheme:

We've been using datamap format files for over a decade so they aren't "new" anymore
It is a simple job to modify the ~2 lines of code per file to swap the -new flag for the -old
An automatic detection scheme needs to be able to differentiate between old/new files for a variety of formats
An automatic detection scheme would need to be implemented and tested in dozens of routines which deal with more than just the raw > fit conversion or processing data
Removing the flexibility to use non-standard file extensions would hurt our userbase

sshepherd commented 7 years ago

The attached overview is by no means perfect, but it should give you a starting point for developing an appreciation for the full RST. In addition to what Evan has said here, there is the mechanism for the software to allow input from stdin so it can be piped from a previous command (see the map_potential part in the overview, all these commands are typically piped together.) While the initial function call can do file extension checking, the latter cannot and some sort of command-line argument is necessary.

The main point we are trying to make is that we are more interested in getting features into the RST: improved gridding, AACGM-v2, moving beyond RG96, etc. Several of these have been sitting in the queue for over a month. The new/old command-line option flip is easy and has been implemented in these pieces of the RST software. A band-aid? Maybe, but it works and is ready to go. If you can put someone (Keith?) on implementing your new idea in all of the RST, not just fitACF, so gridding, mapping, plotting, then great, but I am still advocating for this change until I see that your suggestion is implemented. We have been using this new version of RST since before the workshop and will continue to do so. I would expect there are others would like to do so also.

On Mon, Jun 26, 2017 at 10:55 AM, Evan Thomas notifications@github.com wrote:

Hi @pasha-ponomarenko https://github.com/pasha-ponomarenko - I don't have any block diagrams handy, but I'll try explaining in text. Apologies for the length.

Right now, the steps for processing our data from the most basic product (dat/rawacf [I'm going to ignore iqdats here]) to our highest level (map potential file) looks something like this:

make_fit -new single_rad_data.rawacf > single_rad_data.fitacf

make_grid -new single_rad_data.fitacf > single_rad_data.grdmap

combine_grid -new single_rad_data.fitacf ... other_rad_data.fitacf

sd_data.grdmap

map_grd -new sd_data.grdmap > sd_data.empty.cnvmap

map_addhmb -new sd_data.empty.cnvmap > sd_data.hmb.cnvmap

map_addimf -new sd_data.hmb.cnvmap > sd_data.imf.cnvmap

map_addmodel -new sd_data.imf.cnvmap > sd_data.model.cnvmap

map_fit -new sd_data.model.cnvmap > sd_data.cnvmap

Notice how all of these routines require the presence (or absence) of a -new flag to indicate whether the input file is in the "new" or "old" file format. The old file formats (dat, fit, grd, map) are a mix of binary (dat, fit) and plain-text files (grd, map) while the new formats (rawacf, fitacf, grdmap, cnvmap) are all in the datamap (or dmap) format.

So let's think about the automatic detection scheme proposed for make_fit. We'll probably only ever have .dat or .rawacf files as input for make_fit to produce fit or fitacf files (going away from the self-describing datamap format would be a huge waste of time and resources in my opinion). It's probably safe to assume that users aren't renaming the extensions on their dat or rawacf files, so it would be easy enough to implement the automatic detection for make_fit.

Ok, what about the other processing routines? Well first you also need to worry about the other tools for working with rawacf files: cat_raw, raw_cp, rawstat, and trim_raw. Let's hope the implementation of the automatic detection scheme was simple enough that you don't need to start linking other libraries or you'll have extra compilation testing to do.

Now you can move on to grd/grdmap format files where I think I've already shown one example where one of the larger SD groups (VT) uses non-standard file extensions for grid (and map) files. So not only do you need to make the automatic detection scheme flexible enough to handle the grd and grdmap formats, but you're also going to ask the VT crew to rename thousands of files and modify the software running their website to accommodate that. Likewise for the map/cnvmap files. And this is assuming no one else uses non-standard extensions on their higher-level data files (which I personally think is an excellent feature for allowing flexibility when testing file processing).

Maybe it's short-sighted to assume we'll never move to a newer file format, but given the overall resource limitations facing the SD community I don't see anyone putting much effort into that anytime soon. So for the foreseeable future we'll have the "old" (dat, fit, grd, map) and "new" (rawacf, fitacf, grdmap, cnvmap) file formats to deal with. I'll again note that the SD community has not produced any old dat files in over a decade (since ~June 2006) and the quantity of "new" file formats we've produced since then have vastly outnumbered the old due to the proliferation of radars.

So to summarize on why I think we should replace the -new flag with an -old flag instead of implementing an automatic detection scheme:

We've been using datamap format files for over a decade so they aren't "new" anymore

It is a simple job to modify the ~2 lines of code per file to swap the -new flag for the -old

An automatic detection scheme needs to be able to differentiate between old/new files for a variety of formats

An automatic detection scheme would need to be implemented and tested in dozens of routines which deal with more than just the raw > fit conversion or processing data

Removing the flexibility to use non-standard file extensions would hurt our userbase

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/SuperDARN/rst/issues/7#issuecomment-311083943, or mute the thread https://github.com/notifications/unsubscribe-auth/AC2DP0cceFsfnivy_16gc_EBE-9DVdw2ks5sH8ZvgaJpZM4L5rPT .

pasha-ponomarenko commented 7 years ago

Thanks, guys, I need some time to digest your information.

@sshepherd, I see no attachment to your previous message. Did you mean Evan's explanation?

sshepherd commented 7 years ago

The new v old file flag should be changed for the next release of the RST.

egthomas commented 7 years ago

When typing my original manifesto I did not fully appreciate the subsequent point @sshepherd made regarding file extension checking when piping results from one map_* command to the next. So I still strongly believe/agree that we should make the switch sooner (ie v4.1 release) rather than later.

ksterne commented 7 years ago

I feel as though it's better to move the discussion of -new verus -old back to here as this is where it's been discussed to date and originally came up. To quote the last thing from @pasha-ponomarenko on #100:

Well, what about using both flags for a time being but the non-flagged default calling "new" function? Plus a deprecation warning for the following release. That would not cause much problems and seems to be a good compromise.

I've thought a little about what I think you're after here. Basically:

-old: use the pre-20060701 file format
-new: use the post-20060701 file format but add a deprecation warning
no option/flag: use the post-20060701 file format

Is that what you're thinking here? I guess the tricky part is for code that is thinking on the current scheme is trying to access pre-20060701 data by not using a flag, but may get some errors as the code will try to use post-20060701 reading routines. In that case we may need to add deprecation warnings on the error outputs to make it clearer of what's going on?

pasha-ponomarenko commented 7 years ago

In that case we may need to add deprecation warnings on the error outputs to make it clearer of what's going on?

Yep, plus a general warning that in the next release -new will be abandoned.

ksterne commented 7 years ago

That's the "add a deprecation warning" part on the -new in my list. Guessing that it would be deprecated with version 4.2?

For the scenario I've outlined though, the deprecation will happen immediately. So, you're trying to read in a file with the post-20060701 routines but it's in a format of pre-20060701. I'd expect this will cause an error of something like "error reading file". Here this could be expanded to:

"Error reading file (insert filename here). Likely this is caused by deprecating -new in version 4.1. If using pre-20060701 data, please use -old"

Make sense?

pasha-ponomarenko commented 7 years ago

Sure! ;-)

egthomas commented 7 years ago

Since this topic has apparently been re-opened for discussion, might I humbly suggest looking to the existing software for solutions before trying to invent new ones:

http://superdarn.thayer.dartmouth.edu/documentation/base/src.lib/task/option/OptionProcess.html

int opterr(char *txt) {
  fprintf(stderr,"Option not recognized: %s\n",txt);
  return(-1);
}

int main(int argc,char *argv[]) {
  int arg;

  arg=OptionProcess(1,argc,argv,&opt,opterr);

  if (arg==-1) {
    exit(0);
  }
}

ksterne commented 7 years ago

@egthomas, this looks like it could be useful here. From what I gather the opterr functions would be added to the make_grid, map_grd, and other binaries source codes that are looking for the flags. Then, in the opterr function we could add the deprecation warning messages. If this isn't what you were thinking can you elaborate on how we might use the current software?

sshepherd commented 7 years ago

This is a much more disruptive change to the software than what was suggested (change new to old). Kevin has shown one example that is specific to the CT processing. Other groups use a different mechanism. The old/new flag issue might be resolved for raw/rawacf files by the complicated resolution that has been suggested, but the old/new flag also applies to fit/fitacf, grd/grdmap, etc files and those have no date restrictions. Also, remember that Wallops does not conform. How about piping where there is no filename?

Please stop thinking of this as the fitACF WG. Instead of coming up with complicated solutions specific to how you use the RST, make the obvious and simple change and spend a fraction of the time already wasted modifying your scripts.

Simon

On Thu, Oct 19, 2017 at 10:07 AM Kevin Sterne notifications@github.com wrote:

@egthomas https://github.com/egthomas, this looks like it could be useful here. From what I gather the opterr functions would be added to the make_grid, map_grd, and other binaries source codes that are looking for the flags. Then, in the opterr function we could add the deprecation warning messages. If this isn't what you were thinking can you elaborate on how we might use the current software?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/SuperDARN/rst/issues/7#issuecomment-337919118, or mute the thread https://github.com/notifications/unsubscribe-auth/AC2DP3k5D9NERvEEQay4KogkZd_RO1hEks5st1elgaJpZM4L5rPT .

ksterne commented 7 years ago

@sshepherd, define "this". I'm guessing you mean my comments on additions to the error reporting? I think @egthomas might have something that works a little better with the opterr function, though I haven't seen it in action. It looks like the only thing using it currently is in ~/codebase/superdarn/src.bin/tk/tool/map_ascii.1.8/map_ascii.c. Of course though it's not utilized in the OptionProcess call on line 184.

You do make a good point on the filename and how this will break if using a pipe (at least I think it would, at some point it's got to have something to reference even if it's a temp file that's created...right?). That's also true on date restrictions are only for the raw-level data. So maybe instead of pre/post-20060701 change it to ascii/dmap? Or what do you have a better way to categorize them? Again, I think this goes away or changes with the opterr function at it would address things a different way.

pasha-ponomarenko commented 7 years ago

May I humbly suggest having a Webex telecon on this topic some time earlier next week? I can host it here in the same way as Angeline did for the phase offset task force. From the previous experience I believe it would be much more productive because most of the issues can be resolved on spot.

ksterne commented 7 years ago

I'm down, feel like sending a Doodle poll for when people can make it?

pasha-ponomarenko commented 7 years ago

Why not? Would you be able to do this?

egthomas commented 7 years ago

If we take a bigger picture approach the above logic could be used to trap any unrecognized command line option for any binary. Code doesn't crash and users get their error message.

sshepherd commented 7 years ago

I don't think a telecon with such limited scope is really worth the effort. After a great deal of development and significant contributions to the RST over the past 18 months, Evan and I are focusing on the other work that we have been neglecting. I think we all know the issues and the consequences of the old/new flag. This issue was started back in February and the idea of automatic identification has come up in several forms. There are many, and more signification, issues with this type of change than there would be in just accepting the fact that having to use the "new" flag for file formats that are over a decade old and switching to requiring the use of an "old" flag, is a more sensible solution. That, or just do nothing. Anything else has significant consequences that have not been fully thought through. We have been using the "old' flag for months at Dartmouth with no trouble. The change was relatively painless.

If the scope of the telecon were broadened to talk about other issues that have resulted in insults and generally a bad working environment, we might be interested, but honestly we've moved on from further RST development and are working on other things now. I do think that testing needs addressing by the WG. Kevin S. seems to be doing the majority of the testing, however, I don't think testing on a live system (VT's web services, etc.) is appropriate and in many cases has slowed down the testing. Is Keith testing? Or someone else at Saskatoon? Saskatoon is a data mirror, has access to all the data and has significant membership in the WG.

There are other significant issues that generated a lot of discussion but no resolution. A telecon to discuss those issues would be a better use of everyone's time.

Simon

On Thu, Oct 19, 2017 at 11:19 AM, pasha-ponomarenko < notifications@github.com> wrote:

May I humbly suggest having a Webex telecon on this topic some time earlier next week? I can host it here in the same way as Angeline did for the phase offset task force. From the previous experience I believe it would be much more productive because most of the issues can be resolved on spot.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/SuperDARN/rst/issues/7#issuecomment-337941706, or mute the thread https://github.com/notifications/unsubscribe-auth/AC2DP-YA660ilXuniccWEQ9GdAso0AeMks5st2h0gaJpZM4L5rPT .

pasha-ponomarenko commented 7 years ago

Why not? I am all for a broader agenda too also I would not dismiss the current topic out of hand. Should we form a preliminary agenda or just go with the flow? With respect to testing, I recall that Keith has some points to make on large and complex pull requests.

pasha-ponomarenko commented 7 years ago

Because the working environment issue has been raised, we can start with participants formulating how they see their priorities/responsibilities within this WG and how they believe we need to collaborate in order to achieve/fulfil them. This would lay a good common ground for resolving the outstanding issues.

egthomas commented 6 years ago

This is addressed in #111 for the 4.1 release.

SuperDARN / rst

File format transition #7