Column label problem in Import Data

bruceravel / demeter

Process and analyze X-ray Absorption Spectroscopy data using Feff and either Larch or Ifeffit.

http://bruceravel.github.io/demeter

Other

67 stars 32 forks source link

Column label problem in Import Data #28

Open tschoonj opened 8 years ago

tschoonj commented 8 years ago

Hi Bruce,

Some B18 staff members here at Diamond have discovered that not all columns of their 36-element detector XAS files are not properly represented in the Column selection dialog window. Although all radiobuttons appear to be present, only about half of the corresponding labels are shown, which I assume is due to space constraints.

I am attaching a screenshot that will make this situation clear.

The whole process of opening a file like these also takes quite long: about 10-15 seconds until the Column selection dialog opens. Keeping an eye on top reveals that the larch part is fast, below one second, and the rest of the time is spent in Athena itself. During the file-reading the gui freezes.

As the B18 typically produces a file like this every 5 seconds during an experiment, the loading and processing of a large number of files becomes really slow for them in post-processing.

Any advice on how we could things speed up here?

We are using the last version of demeter installed on Centos 6 machines.

screenshot

bruceravel commented 8 years ago

Re the column labels, I thought, in that situation, Athena would replace the labels with column numbers. I'll look into it.

Re the data every 5 seconds thing, don't use Athena. I don't mean that in a snippish or obnoxious way. I mean that I never wrote Athena with the intent of receiving data at that pace. She's simply not efficient enough for that. To put it another way, we didn't have 5 second scans at my old NSLS beamline, so Athena doesn't know how to cope with that. She and I have co-dependent coping issues :smile:

Here's what I suggest instead. Write a little program that converts the data from whatever form it appears every 5 seconds into a json file of the sort explained in the new version of the user's manual: http://bruceravel.github.io/demeter/documents/Athena/output/project.html You can rely on Athena's defaults for most of the parameters -- the main chore will be to generate the x and y (and optionally the i0 and signal) lists and package them correction. In the args dictionary, I'd recommend setting datatype and label explicitly, but you should be able to rely upon Athena's defaults for all the rest.

Because of Ifeffit's (*) memory limitations, I would recommend limiting each json file to about 30 scans. You could also use this converter to do the right thing with all of the columns in each data file, which will obviate your first problem :exclamation:

The advantage here is that you use a custom tool to manage the data volume associated with 5 second scans. Your visitors can then "enjoy" Athena at leisure with the json-style project files that you generate.

(*) I see from your comment that you are using the Larch backend. That's so exciting! I am thrilled that its working well -- or at all, to be frank! You should still be mindful of Ifeffit's problems because your users will likely be using it when they get home. It's a slow transition.... Your observation of top is interesting. I'll try to do some profiling to see if I can figure out what the hold-up is. Although, are your sure that Athena per se is the slow-poke and not NFS or how Athena is using NFS? (Weren't you the person who diagnosed an NFS problem a while back...?)

bruceravel commented 8 years ago

Could you email me one of the files that is taking 10 seconds for the column selection dialog to make an appearance? Or post it on gist. Thanks.

tschoonj commented 8 years ago

I just emailed you a typical file that takes quite long to load.

We are indeed using the Larch backend: the main reason behind this is Ifeffit being unable to read this multi detector element files properly. It appears to read only the first x rows and ignores the rest of the file, which probably has to do with its memory limitations.

Otherwise the Larch backend appears to work quite well, even on a machine that is running multiple Athena sessions by multiple users simultaneously. Once or twice we ran into the following problem:

[xxxxx@yyyyyy ~]$ dathena
Traceback (most recent call last):
  File "/dls_sw/apps/python/anaconda/1.7.0/64/bin/larch", line 79, in <module>
    local_echo=options.echo, quiet=options.quiet)
  File "/dls_sw/apps/python/anaconda/1.7.0/64/lib/python2.7/site-packages/larch/xmlrpc_server.py", line 35, in __init__
    logRequests=False, allow_none=True, **kws)
  File "/dls_sw/apps/python/anaconda/1.7.0/64/lib/python2.7/SimpleXMLRPCServer.py", line 593, in __init__
    SocketServer.TCPServer.__init__(self, addr, requestHandler, bind_and_activate)
  File "/dls_sw/apps/python/anaconda/1.7.0/64/lib/python2.7/SocketServer.py", line 420, in __init__
    self.server_bind()
  File "/dls_sw/apps/python/anaconda/1.7.0/64/lib/python2.7/SocketServer.py", line 434, in server_bind
    self.socket.bind(self.server_address)
  File "/dls_sw/apps/python/anaconda/1.7.0/64/lib/python2.7/socket.py", line 228, in meth
    return getattr(self._sock,name)(*args)
socket.error: [Errno 98] Address already in use

which we haven't been able to replicate.

If I may also make a suggestion: the Demeter installer appears to have a hard dependency on Ifeffit. Perhaps if the installer can verify that a Larch installation is already present, there is no need to force the user to also install Ifeffit?

About the JSON file: I assume you recommend this file format because it is processed faster in Athena? Is Larch still used then?

The B18 staff already have scripts that deal with summing up the counts in the different elements of the detector. They were even an absolute necessity to get their data read by the Ifeffit backend. I assume they could be modified to output to JSON, but I will need to discuss this with them.

tschoonj commented 8 years ago

Oh and I am indeed the NFS guy (although it was actually an access control lists problem) :smile:

In this case, although the file is indeed stored on an NFS partition, I doubt that it has a lot of influence here as it is only 1.7 MB.

bruceravel commented 8 years ago

If I may also make a suggestion: the Demeter installer appears to have a hard dependency on Ifeffit. Perhaps if the installer can verify that a Larch installation is already present, there is no need to force the user to also install Ifeffit?

Edit the properties of the Athena desktop icon. Change the target from "dathena" to "lathena". There are "l" versions of the Athena, Artemis, and Hephaestus batfiles in perl/site/bin/ of the installation location. As with the "d", the "l" is silent when pronounced :smile:

It's up to you to install Larch on the machine.

It's been a while since I tested these, but I think they work....

bruceravel commented 8 years ago

About the JSON file: I assume you recommend this file format because it is processed faster in Athena? Is Larch still used then?

That's not it at all. The json is simply a different format for the Athena project file. It has the same information contents as the conventional project file (which is a serialization of Demeter data structures), but is easier for other applications to write.

The reason I suggest this route is that it bypasses the whole process of repetitively importing data files. No one uses Athena because they love the column selection dialog. Writing qxas data directly to a project file lets you start using Athena for the stuff that is more satisfying.

If you take a step back, you could (and probably should) question the wisdom of presenting your users with data files like the one you sent me. Not only does you user not really want to interact with the column selection dialog, your user doesn't really want to figure which columns in the file are interesting in the first place.

How often does your user change her mind about which columns to select? Not very often, I bet! So why are you making her think about that? Why not write a file that just has energy, signal, and I0 (and time, I suppose)? Isn't that what she wants?

Or, to make the same argument in the opposite way, why do you do dead-time corrections before writing this file? (Do you do deadtime correction? I bet you are because I see the word "xpress" in the file header.) Shouldn't you be writing a file with ICR and OCR values so the user can check your deadtime correction? The answer is: "of course not". For a staff member, or in the rare case where it matters for a user, you Diamond folk have an HDF5 file with all that stuff in it. If you present your user with deadtime corrected data for each element, why not take the next step and give them what they really want? Basically, I am suggesting that you remove even more of the friction from these measurements.

bruceravel commented 8 years ago

Oh and I am indeed the NFS guy (although it was actually an access control lists problem) :smile:

In this case, although the file is indeed stored on an NFS partition, I doubt that it has a lot of influence here as it is only 1.7 MB.

I agree that NFS has nothing to do with it. My current suspicion (although I have thus far spent much more time eating lunch than looking into it) is that the slowness is in trying to remove the background from wonky data. I suspect that Athena is assuming that these data are transmission because "I0" and "It" as meaningful column labels. But there is no step in "It", so autobk grinds away for a while before giving up the ghost.

Another option, besides any of the other suggestion I have made in this issue thread, would be to write a filetype plugin that recognizes the file as being from B18 and measured with your 36 element detector. The advantage of using a filetype plugin is that it provides a mechanism for suggesting which columns to choose. If autobk on non-data is, indeed, the problem, a filetype plugin would be a good solution.

It'd take me less than 30 minutes to make one. I'll see what I can do. If it fixes the slowness problem, I'll send it to you for a try-out.

newville commented 8 years ago

@tschoonj @bruceravel I would definitely suggest that the beamline have a tool that reduced the data to something more sensible than a file with more than 20 columns. We (at my beamline) have produced such large number of data channels for years, and have been advocating this position for a very long time. We recently saw a similar question on the ifeffit mailing list about data from SSRL. FWIW, we don't have an Athena plugin, we have a standalone conversion tool.

It's nice that Athena is able to deal so well with multiple columns, but there are limits to what is possible. For example, it can not deal with doing per-channel deadtime corrections The whole premise of using ASCII column files is that they are human readable. Indeed, to import these into Athena, the user has to explicitly select columns -- the files are not fully parsed and digested. With more than 10 or 20 columns, that concept breaks down, and Athena is really hard to use. The file, though ASCII encoded, is essentially binary. An advantage of a beamline-specific tool is that it could use other binary file types (netcdf, hdf5, etc).

We see issues at our beamline with people using Athena with raw fluorescence XAFS files all the time. It's not that Athena gets it wrong or the people are dim, it's that they didn't use the right tool for the job. When we show them the right tool (we call it "deadtime correction", but doing the summing and file simplification is just as important), their lives get much, much simpler: Columns 1 and 2 are energy and mu_fluorescence.

In short, a beamline-specific conversion tool are the right solution to the problem you're facing.

tschoonj commented 8 years ago

Hi Bruce and Matt,

Many thanks for the very useful suggestions. I will discuss with the beamline staff how we can improve their dataprocessing strategies based on your recommendations.

@bruceravel Many, many, many thanks for writing the Demeter plugin! I will give it a try tomorrow morning.