SuperDARN / rst

Radar Software Toolkit (RST)
https://superdarn.github.io/rst/
GNU General Public License v3.0
22 stars 17 forks source link

20160202.02*sto* error/warning message missing #90

Closed ksterne closed 6 years ago

ksterne commented 7 years ago

Here is another case raise in the DDWG where a corrupted file, 20160202.02sto.rawacf, appears to process into fit-level data fine and well without any kind of error message output. So, it appears as though things are fine and well when it isn't.

The interesting thing is that the make_fit (and make_fitex2) process generates a seemingly good file. So I think the error handing is in place, but there isn't an output to let the user know there may be an issue.

Maybe this isn't so much of something to add for the fitacf.2.x, but to make sure it's handled in #74 well.

egthomas commented 7 years ago

Are you sure this file isn't seg faulting on make_fit? From sd-work7 this is what I get:

evan@sd-work7:~$ make_fit -new -vb 20160202.0201.00.sto.rawacf > 20160202.0201.00.sto.fitacf
2016-2-2 2:1:0 beam=15
2016-2-2 2:1:3 beam=14
2016-2-2 2:1:6 beam=13
...
2016-2-2 3:35:43 beam=1
2016-2-2 3:35:46 beam=0
2016-2-2 3:36:0 beam=15
2016-2-2 3:36:3 beam=14
2016-2-2 3:36:6 beam=13
2016-2-2 3:36:9 beam=12
2016-2-2 3:36:12 beam=11
2016-2-2 3:36:15 beam=10
2016-2-2 3:36:18 beam=9
Segmentation fault (core dumped)
evan@sd-work7:~$ 

There will always be an output file generated by make_fit - that doesn't necessarily mean that it functioned correctly.

ksterne commented 7 years ago

Ah, I'm not seeing the seg fault because of how the script I'm using is doing piping. Is a seg fault a good error/warning message though?

kkotyk commented 7 years ago

@ksterne No, its a terrible way to exit a program. I've notice that fitacf, etc doesn't really segfault. Its almost always in the dmap code.

sshepherd commented 7 years ago

To be clear, one does not choose to seg fault, i.e., the programmer is not choosing to exit a program by "seg faulting", rather it is a run-time error that occurs when a program is attempting to access memory that it should be.

Simon

On Mon, Jul 31, 2017 at 3:55 PM, kkotyk notifications@github.com wrote:

@ksterne https://github.com/ksterne No, its a terrible way to exit a program. I've notice that fitacf, etc doesn't really segfault. Its almost always in the dmap code.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/SuperDARN/rst/issues/90#issuecomment-319178104, or mute the thread https://github.com/notifications/unsubscribe-auth/AC2DPxVA_b1zO3_XrZ2M82cblKz4Mw6Qks5sTjE8gaJpZM4OosTQ .

egthomas commented 7 years ago

Is it possible the seg fault was missed in #83 because of the piping being used in @ksterne's script?

kkotyk commented 7 years ago

I think that a segfault message is sent to stderr, so if his script isn't logging stderr, then it is possible.

egthomas commented 7 years ago

So can we close this issue? It sounds like we're in agreement that this particular file (20160202.0201.00.sto.rawacf) produces a seg fault when passed through make_fit and is a file corruption rather than software issue.

sshepherd commented 7 years ago

Yes

On Fri, Aug 4, 2017 at 11:03 AM, Evan Thomas notifications@github.com wrote:

So can we close this issue? It sounds like we're in agreement that this particular file (20160202.0201.00.sto.rawacf) produces a seg fault when passed through make_fit and is a file corruption rather than software issue.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/SuperDARN/rst/issues/90#issuecomment-320272091, or mute the thread https://github.com/notifications/unsubscribe-auth/AC2DP5dwJ1lcNrxPS0vG4007Zdp1nNyxks5sUzLAgaJpZM4OosTQ .

ksterne commented 7 years ago

I'd say no here. Do we know that the issue is with the data being written in a bad way or is part of the code making an assumption that's not always the case? And even if the data is corrupted/unreadable, is it possible that we can modify code to give a more helpful error message output?

A good case of code making an assumption here is somewhere in the make_fit code there was an assumption that the acfd data array is always present. But some files (I can't give an example from home) have a header section (time, beam, etc.) but the acfd data array is missing. So, do we know if there is something like this in this file?

sshepherd commented 7 years ago

A seg fault is a run-time error. There is no error message from the program because the program crashes. Any file that causes our standard processing software to crash should be immediately blacklisted.

If you want to dig into the code that's fine, but it is extremely disruptive to have files that cause seg faults in the data stream. These files can be put back into circulation if a 'fix' is made. For that to happen it would be useful if the DDWG website was updated regularly to indicate which files are blacklisted. If it is an issue of limited access to editing the website then perhaps the WG chairs should think about moving the website to a more central location.

On Fri, Aug 4, 2017 at 2:35 PM, Kevin Sterne notifications@github.com wrote:

I'd say no here. Do we know that the issue is with the data being written in a bad way or is part of the code making an assumption that's not always the case? And even if the data is corrupted/unreadable, is it possible that we can modify code to give a more helpful error message output?

A good case of code making an assumption here is somewhere in the make_fit code there was an assumption that the acfd data array is always present. But some files (I can't give an example from home) have a header section (time, beam, etc.) but the acfd data array is missing. So, do we know if there is something like this in this file?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/SuperDARN/rst/issues/90#issuecomment-320322726, or mute the thread https://github.com/notifications/unsubscribe-auth/AC2DP95mDcF2Ul-tHUNNhtghRTRQXyt1ks5sU2RYgaJpZM4OosTQ .

ksterne commented 7 years ago

To be clear here, I'm not trying to be in favor of keeping this file out for general distribution. More so the issue here is the code is not generating any kind of warning/error message about why there is a run-time error. What about this file is causing an error and why does the code not provide better information to a user aside from 'this file is bad'?

egthomas commented 7 years ago

@kkotyk and I looked into this file a few weeks ago before asking the DDWG to blacklist it:

Hello all,

I've identified a Stokkseyri rawacf file which produces segmentation faults when trying to process with make_fit. Keith Kotyk has also examined this file and found that some kind of corruption causes the dmap routines to read past the file boundaries (sorry if I butchered your explanation Keith). The file in question is

20160202.0201.00.sto.rawacf.bz2

I suggest that we blacklist this file immediately on both data mirrors to remove it from the general distribution.

Cheers, Evan

kkotyk commented 7 years ago

@ksterne Usually when these files give problems, I use the Python routine to read it in and see whats happening. I've actually included an exception system that does say what's wrong with files if it fails to parse.

kkotyk commented 7 years ago

What about this file is causing an error and why does the code not provide better information to a user aside from 'this file is bad'?

Because the implementation and error handling of the C DMAP routines is very lackluster.

sshepherd commented 7 years ago

ksterne, your response to blacklisting this file seemed to be clear from your last message:

"I'd say no here."

I think we all agree that this file should be blacklisted.

Diagnostics for problem files can be added to your processing scripts or to separate programs, but they should not (in my opinion) be added to the standard RST data processing programs. Any additional error checking will inevitably slow down processing of files and make mass-processing, which is necessary for large statistical studies, more difficult.

If I recall correctly, APL used to run a separate program on files that would check for various issues before putting them into the data stream. I was under the assumption that that practice was being performed by one of the data mirror entities (VT or USASK), but perhaps that is no longer the case. If that is true, perhaps similar data quality protocols should be reinstated.

Simon

On Fri, Aug 4, 2017 at 8:32 PM, Kevin Sterne notifications@github.com wrote:

To be clear here, I'm not trying to be in favor of keeping this file out for general distribution. More so the issue here is the code is not generating any kind of warning/error message about why there is a run-time error. What about this file is causing an error and why does the code not provide better information to a user aside from 'this file is bad'?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/SuperDARN/rst/issues/90#issuecomment-320379762, or mute the thread https://github.com/notifications/unsubscribe-auth/AC2DP78l7sWw4ex-DtAoj5dfXhoVf9itks5sU7gTgaJpZM4OosTQ .

egthomas commented 7 years ago

I'm still not convinced this or #83 are real issues. Can we agree to close these?

ksterne commented 7 years ago

Sorry for the slow response here. Got caught up in making repairs to fhw and then preparing for the eclipse.

I think there were some double negative issues that I had. To be clearer, I am all for blacklisting files that aren't usable and are corrupted.

I do not agree on closing these issues though. It seems agreed that the issue could very well be with dmap. Is dmap not apart of the RST codebase? If it is, then the code should be updated so that it does not seg fault and provides a better error message.

As for slowing down processing and mass-processing with error checking, this is something that can be overcome with better processing technologies becoming cheaper. I just played around late last week and was able to process a months worth of fit-level data into grid- and map-level data for both hemispheres in about an hour on a server we purchased 4 years ago.

egthomas commented 7 years ago

The number of rawacf files which produce segmentation faults is only a tiny fraction of the entire data set. Rather than spending enormous effort debugging the dmap routines for these fringe cases, why not just run them through @kkotyk's more rigorous python-based dmap code as we encounter them to identify whether they are truly recoverable?

I could be mistaken but I don't think we've seen a proven case where the python code is capable of fully reading a rawacf file that the C code is not. My understanding is that both sets of code will fail, except the python code may return a more informative error message than segmentation fault.

ksterne commented 7 years ago

Were fringe cases not debugged in #52? It seemed as though there were a few additional commits after the initial pull request in order to satisfy these fringe cases.

I agree this is for only a few files right now. I'd rather spend the time "now" (I'm not sure anyone has looked at these), so that time can be saved later when new files pop up. Otherwise we'd be repeating effort to make sure it is something in dmap code rather than another part of the code. I'm happy to take on resolving this issue if no one else wants to.

egthomas commented 7 years ago

I'm sorry, but that's a disingenuous attempt to draw a parallel between my usage of the word "fringe" to describe unrelated scenarios. There is a fundamental difference between the code completely falling over due to a corrupted file versus the code writing erroneous values without any indication. One is immediately apparent to the user while the other is not. One introduces false information into higher level data products while the other cannot by its very nature.

ksterne commented 7 years ago

Fair enough and I certainly agree that effort is good to keep code from writing erroneous values without indication. I'll close this and #83 if there's no further comment on them by tomorrow. It seems as though we are not interested in improving the dmap code at this time.

sshepherd commented 7 years ago

Kevin,

dmap is not the problem. Modifying it in an attempt to tell you what is wrong with one file is not an improvement. There are lots of reasons binary files can become corrupt and we should not be trying to determine why a very small fraction of files are corrupt using dmap. Doing so will inevitably slow down processing. You should spend some time modifying your processing scripts to trap seg faults and then use a separate utility to further investigate what might be wrong with the file.

Simon

On Tue, Aug 29, 2017 at 12:24 PM, Kevin Sterne notifications@github.com wrote:

Fair enough and I certainly agree that effort is good to keep code from writing erroneous values without indication. I'll close this and #83 https://github.com/SuperDARN/rst/issues/83 if there's no further comment on them by tomorrow. It seems as though we are not interested in improving the dmap code at this time.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/SuperDARN/rst/issues/90#issuecomment-325718212, or mute the thread https://github.com/notifications/unsubscribe-auth/AC2DPznoelbxcJJOJtGlBjVFC6ESpJxRks5sdDtPgaJpZM4OosTQ .

kkotyk commented 7 years ago

Simon

I think thats a very dangerous opinion to have. Something like what you are suggesting would have to documented and heavily underlined. You are a seasoned veteran at data processing, but new students or those just starting with our data might get trapped in pitfalls. Also, dmap is not used just by RST. It is a core util that most people would have to use in order to write custom projects, such as davitpy. I think its entirely reasonable that people could want to use parts of RST for their research, especially a essential function such as dmap, and expect it to either work or throw proper errors. That being said, from first hand experience fixing dmap would be a massive undertaking with C.

sshepherd commented 7 years ago

Keith,

I appreciate your thoughts, but strongly disagree. What I described is minimal error checking for the core processing routines. There is a minimal expectation that files conform to certain standards. That isn't always the case. APL used to look at every file before sending out in the data distribution stream. This process was separate from the standard processing, which should remain minimal in its error checking as a trade-off for speed that allows for rapid processing. It is perfectly acceptable for others to develop more robust (and slower) processing routines, but the core reading (dmap) should emphasize speed over error checking.

Maybe the chair will weigh in at some point?

Simon

On Tue, Aug 29, 2017 at 4:06 PM, kkotyk notifications@github.com wrote:

Simon

I think thats a very dangerous opinion to have. Something like what you are suggesting would have to documented and heavily underlined. You are a seasoned veteran at data processing, but new students or those just starting with our data might get trapped in pitfalls. Also, dmap is not used just by RST. It is a core util that most people would have to use in order to write custom projects, such as davitpy. I think its entirely reasonable that people could want to use parts of RST for their research, especially a essential function such as dmap, and expect it to either work or throw proper errors. That being said, from first hand experience fixing dmap would be a massive undertaking with C.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/SuperDARN/rst/issues/90#issuecomment-325784377, or mute the thread https://github.com/notifications/unsubscribe-auth/AC2DP9CxFWP7mIJVjmg0hqfNAGS33WfEks5sdG9JgaJpZM4OosTQ .

pasha-ponomarenko commented 7 years ago

@sshepherd, not until I understand the potential consequences of both approaches.

sshepherd commented 7 years ago

Ok. I guess we will just wait and pick up the conversation at a later time.

pasha-ponomarenko commented 7 years ago

Yep, it seems to be getting a bit too technical and requires a bit more education on my side. I would also like to see some numbers on how much slower/faster the processing routines would become.

egthomas commented 6 years ago

Closing this issue since we agreed on the telecon that it stemmed from a blacklisted file.