Closed vbalderdash closed 4 years ago
OK, this is really straightforward. So then some next things to add would be:
datastart
method; return a numpy datetime64
tstart
, tend
timedelta64
: probably tsec = lmad['time'].astype('timedelta64[s]')
tstart+tsec
Per number 4. The number of stations contributing to each source should already archived in the LMA file, right? Or are you thinking of a selection based on the list of which stations were contributing?
Not from all networks. OKLMA data I used back in the day had the station count, but you have to count the bits for the WTLMA data and others I've seen.
Data: time (UT sec of day), lat, lon, alt(m), reduced chi^2, P(dBW), mask
Data format: 15.9f 12.8f 13.8f 9.2f 6.2f 5.1f 5x
Number of events: 257422
*** data ***
12599.289137587 3.16553416 -113.07229594 257639475.42 47.17 70.4 0x364
12599.999582905 28.83720144 -112.52882903 746668.56 38.71 19.7 0x724
Given the differences in the two formats, it seems that we actually need to parse on the basis of the Data:
line above.
Alright, I've been staring at the one format too long! That should be an easy adjustment.
Here's a snippet to automatically pull the column headers from the file header. I'm sure there's a more fundamental method to parse the text line from the zipped file, but this does get the job done.
def datastart(self):
with open(self.file) as f:
for line_no, line in enumerate(f):
if line.rstrip() == "*** data ***":
break
f.close()
return line_no
def dataformat(self):
with open(self.file) as f:
for line_no, line in enumerate(f):
if line.startswith('Data:'): break
f.close()
names = pd.read_csv(self.file,
header=None,skiprows=line_no,nrows=1)
names = names.values.tolist()[0]
names[0] = names[0][6:]
return [x.strip(' ') for x in names]
def dataformat_gz(self):
with gzip.open(self.file) as f:
for line_no, line in enumerate(f):
if line.startswith(b'Data:'): break
f.close()
names = pd.read_csv(self.file,compression='gzip',
header=None,skiprows=line_no,nrows=1)
names = names.values.tolist()[0]
names[0] = names[0][6:]
return [x.strip(' ') for x in names]
def datastart_gz(self):
with gzip.open(self.file,'rt') as f:
for line_no, line in enumerate(f):
if line.rstrip() == "*** data ***":
break
f.close()
return line_no
def readfile(self):
if self.file[-4:] == '.dat':
lmad = pd.read_csv(self.file,delim_whitespace=True,header=None,skiprows=self.datastart()+1)
columns = self.dataformat()
if self.file[-7:] == '.dat.gz':
lmad = pd.read_csv(self.file,compression='gzip',delim_whitespace=True,
header=None,skiprows=self.datastart_gz()+1)
columns = self.dataformat_gz()
lmad.columns = columns
return lmad `
Added some pieces for pulling more of the header information via Pandas and separating out all of the stations in the mask using the mask_to_int function from lmatools. Time issues 1-3 are addressed in the addition of a new column called 'Datetime.'
Potential issue: Header information (station locations, % contributions, etc.) are read in with a fixed width format since station names may white space. I'm unsure how consistent those widths are among existing file formats.
def mask_to_int(mask):
""" Convert object array of mask strings to integers"""
if len(mask.shape) == 0:
mask_int = np.asarray([], dtype=int)
else:
try:
# mask is a plain integer
mask_int = np.fromiter((int(v) for v in mask), int)
except ValueError:
# mask is a string representing a base-16 (hex) number
mask_int = np.fromiter((int(v,16) for v in mask), int)
return mask_int
class lmafile:
def __init__(self,filename):
self.file = filename
with gzip.open(self.file) as f:
for line_no, line in enumerate(f):
if line.startswith(b'Data start time:'):
timestring = line.decode().split()[-2:]
self.startday = dt.datetime.strptime(timestring[0],'%m/%d/%y')
# Full start time and second, might be unneeded
# self.starttime = dt.datetime.strptime(timestring[0]+timestring[1],'%m/%d/%y%H:%M:%S')
# self.startsecond = (starttime-dt.datetime(starttime.year,starttime.month,starttime.day)).seconds
# Find rows for station information
if line.startswith(b'Station information:'):
self.station_info_start = line_no
if line.startswith(b'Station data:'):
self.station_data_start = line_no
if line.startswith(b'Metric file:'):
self.station_data_end = line_no
# Find mask list
if line.startswith(b'Station mask order:'):
self.maskorder = line.decode().split()[-1]
# Pull data header
if line.startswith(b'Data:'):
self.names = [x.strip(' ') for x in line.decode()[5:-1].split(",")]
# Find start of the data
if line.rstrip() == b"*** data ***":
break
f.close()
self.data_starts = line_no
# Station overview information
self.overview = pd.read_fwf(self.file,compression='gzip',
colspecs=[[10,11],[13,30],[30,35],[35,43],[43,48],[48,56],[56,61],[61,68],[68,73]],
names=['ID', 'Name','win(us)', 'dec_win(us)', 'data_ver', 'rms_error(ns)',
'sources','<P/P_m>','active'],
header=None,skiprows=self.station_data_start+1,
nrows=self.station_data_start-self.station_info_start-1)
# Station Locations
self.stations = pd.read_fwf(self.file,compression='gzip',
colspecs=[[10,11],[13,32],[32,43],[44,56],[56,66],[66,70]],
names=['ID', 'Name','Lat','Long','Alt','Delay Time'],
header=None,skiprows=self.station_info_start+1,
nrows=self.station_data_start-self.station_info_start-1)
def readfile(self):
# Read in data
lmad = pd.read_csv(self.file,compression='gzip',delim_whitespace=True,
header=None,skiprows=self.data_starts+1)
lmad.columns = self.names
# Convert seconds column to new datetime-formatted column
lmad.insert(1,'Datetime',pd.to_timedelta(lmad['time (UT sec of day)'], unit='s')+self.startday)
# Parse out which stations contributed into new columns for each station
for index,items in enumerate(self.maskorder[::-1]):
lmad.insert(8,items,(mask_to_int(lmad["mask"])>>index)%2)
# Count the number of stations contributing and put in a new column
lmad.insert(8,'Station Count',lmad[list(self.maskorder)].sum(axis=1))
return lmad
Nice, thanks! Could you add a complete version of your reader class in something like xlma-python/pyxlma/lmalib/io/read.py
?
pyxlma will live at the same level as the main README, and will be the eventual package name. We'll probably need to rearrange/rename later. lmalib is to separate the plot-agnostic calculations and readers from any GUI or plotting stuff that makes use of data processed in lmalib.
Regarding the header format, yeah – there's no Data format: 15.9f 12.8f 13.8f 9.2f 6.2f 5.1f 7x
line like there is for the main data section. I think we just live with it. Maybe there are some simple checks that could be done to throw a warning / error if, say, lon lat for stations aren't close to the coord center.
With the lack of the check on format, it has been added. Maybe an check on whether the values read in are floats/strings would work? Pandas can infer the widths instead of specifying it. I have not had luck with that method, but maybe it's something that would be useful to troubleshoot.
Cool, I just swapped this in to my glueviz example (now added, too) and it works great. I also added some minimal packaging stuff so pyxlma
can be installed as a package. Thanks!
Technically we've fulfilled the basics of this issue. Are we ready to start splitting out a separate issues for specific things like the header i/o? Or keep this as an open issue until it's a bit more mature?
It's probably fine to split it into smaller issues and work on the specific things along the way.
I just played around with the glueviz example, looks like it will be useful!
Closing this issue, having opened #6 to continue the discussion about the remaining issue about the header format.
A minimal, Pandas-based reader for LMA files would be useful. An example of such a method is below with a basic selection within the dataframe, which could be easily extended to other criteria.
Basic Pandas object from .dat.gz file
Use example: