deeplycloudy / xlma-python

A future, Python-based version of xlma?
MIT License
6 stars 6 forks source link

Pandas-based LMA datafile reader #2

Closed vbalderdash closed 4 years ago

vbalderdash commented 5 years ago

A minimal, Pandas-based reader for LMA files would be useful. An example of such a method is below with a basic selection within the dataframe, which could be easily extended to other criteria.

Basic Pandas object from .dat.gz file

import pandas as pd
import gzip

class lmafile:
    def __init__(self,filename):
        self.file = filename

    def datastart(self):
        with gzip.open(fname,'rt') as f:
            for line_no, line in enumerate(f):
                    if line.rstrip() == "*** data ***":
                        break
            f.close()
            return line_no

    def readfile(self):
        lmad = pd.read_csv(self.file,delim_whitespace=True,header=None,
                                          compression='gzip',skiprows=self.datastart()+1)
        columns = ['time','lat','lon','alt','chi','num','p','charge','mask']
        lmad.columns = columns
        return lmad

Use example:

max_chi  = 1.0
min_station  = 7.0

lma_file = [.dat.gz data file]
lmad = ls.lmafile(lma_file).readfile()

tstart = [minute of day, but should probably be switched to datetime object]
tend   = [minute of day]

sources = lmad[(lmad['time'] >= tstart)  & (lmad['time'] < tend) & 
               (lmad['chi']  <= max_chi)   & (lmad['num'] >= min_station)]
deeplycloudy commented 5 years ago

OK, this is really straightforward. So then some next things to add would be:

  1. of grabbing the start date from the datastart method; return a numpy datetime64 tstart, tend
  2. Convert the time column to a timedelta64: probably tsec = lmad['time'].astype('timedelta64[s]')
  3. Reassign the time column as tstart+tsec
  4. calculate the station count from the mask using the function that's already in lmatools
vbalderdash commented 5 years ago

Per number 4. The number of stations contributing to each source should already archived in the LMA file, right? Or are you thinking of a selection based on the list of which stations were contributing?

deeplycloudy commented 5 years ago

Not from all networks. OKLMA data I used back in the day had the station count, but you have to count the bits for the WTLMA data and others I've seen.

Data: time (UT sec of day), lat, lon, alt(m), reduced chi^2, P(dBW), mask
Data format: 15.9f 12.8f 13.8f 9.2f 6.2f 5.1f 5x
Number of events: 257422
*** data ***
12599.289137587   3.16553416 -113.07229594 257639475.42  47.17  70.4 0x364
12599.999582905  28.83720144 -112.52882903 746668.56  38.71  19.7 0x724
deeplycloudy commented 5 years ago

Given the differences in the two formats, it seems that we actually need to parse on the basis of the Data: line above.

vbalderdash commented 5 years ago

Alright, I've been staring at the one format too long! That should be an easy adjustment.

vbalderdash commented 5 years ago

Here's a snippet to automatically pull the column headers from the file header. I'm sure there's a more fundamental method to parse the text line from the zipped file, but this does get the job done.

def datastart(self):
    with open(self.file) as f:         
        for line_no, line in enumerate(f):
            if line.rstrip() == "*** data ***":
                break
    f.close()
    return line_no

def dataformat(self):
    with open(self.file) as f:         
        for line_no, line in enumerate(f):
            if line.startswith('Data:'): break
        f.close()
    names = pd.read_csv(self.file,
                        header=None,skiprows=line_no,nrows=1)
    names = names.values.tolist()[0]
    names[0] = names[0][6:]
    return [x.strip(' ') for x in names]

def dataformat_gz(self):
    with gzip.open(self.file) as f:         
        for line_no, line in enumerate(f):
            if line.startswith(b'Data:'): break
        f.close()
    names = pd.read_csv(self.file,compression='gzip',
                        header=None,skiprows=line_no,nrows=1)
    names = names.values.tolist()[0]
    names[0] = names[0][6:]
    return [x.strip(' ') for x in names]

def datastart_gz(self):
    with gzip.open(self.file,'rt') as f:
        for line_no, line in enumerate(f):
            if line.rstrip() == "*** data ***":
                break
    f.close()
    return line_no

def readfile(self):
    if self.file[-4:] == '.dat':
        lmad = pd.read_csv(self.file,delim_whitespace=True,header=None,skiprows=self.datastart()+1)
        columns = self.dataformat()
    if self.file[-7:] == '.dat.gz':
        lmad = pd.read_csv(self.file,compression='gzip',delim_whitespace=True,
                            header=None,skiprows=self.datastart_gz()+1)
        columns = self.dataformat_gz()
    lmad.columns = columns
    return lmad `
vbalderdash commented 4 years ago

Added some pieces for pulling more of the header information via Pandas and separating out all of the stations in the mask using the mask_to_int function from lmatools. Time issues 1-3 are addressed in the addition of a new column called 'Datetime.'

Potential issue: Header information (station locations, % contributions, etc.) are read in with a fixed width format since station names may white space. I'm unsure how consistent those widths are among existing file formats.

def mask_to_int(mask):
    """ Convert object array of mask strings to integers"""
    if len(mask.shape) == 0:
        mask_int = np.asarray([], dtype=int)
    else:
        try:
            # mask is a plain integer
            mask_int = np.fromiter((int(v) for v in mask), int)
        except ValueError:
            # mask is a string representing a base-16 (hex) number
            mask_int = np.fromiter((int(v,16) for v in mask), int)
    return mask_int

class lmafile:
    def __init__(self,filename):
        self.file = filename

        with gzip.open(self.file) as f: 
            for line_no, line in enumerate(f):
                if line.startswith(b'Data start time:'):
                    timestring = line.decode().split()[-2:]
                    self.startday = dt.datetime.strptime(timestring[0],'%m/%d/%y')
                    # Full start time and second, might be unneeded
                    # self.starttime = dt.datetime.strptime(timestring[0]+timestring[1],'%m/%d/%y%H:%M:%S')
                    # self.startsecond = (starttime-dt.datetime(starttime.year,starttime.month,starttime.day)).seconds
                # Find rows for station information
                if line.startswith(b'Station information:'):
                    self.station_info_start = line_no
                if line.startswith(b'Station data:'):
                    self.station_data_start = line_no
                if line.startswith(b'Metric file:'):
                    self.station_data_end = line_no
                # Find mask list
                if line.startswith(b'Station mask order:'): 
                    self.maskorder = line.decode().split()[-1]
                # Pull data header
                if line.startswith(b'Data:'): 
                    self.names = [x.strip(' ') for x in line.decode()[5:-1].split(",")]
                # Find start of the data
                if line.rstrip() == b"*** data ***":
                    break
        f.close()
        self.data_starts = line_no

        # Station overview information
        self.overview = pd.read_fwf(self.file,compression='gzip',
                colspecs=[[10,11],[13,30],[30,35],[35,43],[43,48],[48,56],[56,61],[61,68],[68,73]],
                names=['ID', 'Name','win(us)', 'dec_win(us)', 'data_ver', 'rms_error(ns)', 
                       'sources','<P/P_m>','active'],
                header=None,skiprows=self.station_data_start+1, 
                nrows=self.station_data_start-self.station_info_start-1)
        # Station Locations
        self.stations = pd.read_fwf(self.file,compression='gzip',
                                    colspecs=[[10,11],[13,32],[32,43],[44,56],[56,66],[66,70]],
                                    names=['ID', 'Name','Lat','Long','Alt','Delay Time'],
                                    header=None,skiprows=self.station_info_start+1, 
                                    nrows=self.station_data_start-self.station_info_start-1)

    def readfile(self):
        # Read in data
        lmad = pd.read_csv(self.file,compression='gzip',delim_whitespace=True,
                            header=None,skiprows=self.data_starts+1)
        lmad.columns = self.names

        # Convert seconds column to new datetime-formatted column
        lmad.insert(1,'Datetime',pd.to_timedelta(lmad['time (UT sec of day)'], unit='s')+self.startday)

        # Parse out which stations contributed into new columns for each station
        for index,items in enumerate(self.maskorder[::-1]):
            lmad.insert(8,items,(mask_to_int(lmad["mask"])>>index)%2)
        # Count the number of stations contributing and put in a new column
        lmad.insert(8,'Station Count',lmad[list(self.maskorder)].sum(axis=1))
        return lmad
deeplycloudy commented 4 years ago

Nice, thanks! Could you add a complete version of your reader class in something like xlma-python/pyxlma/lmalib/io/read.py?

pyxlma will live at the same level as the main README, and will be the eventual package name. We'll probably need to rearrange/rename later. lmalib is to separate the plot-agnostic calculations and readers from any GUI or plotting stuff that makes use of data processed in lmalib.

deeplycloudy commented 4 years ago

Regarding the header format, yeah – there's no Data format: 15.9f 12.8f 13.8f 9.2f 6.2f 5.1f 7x line like there is for the main data section. I think we just live with it. Maybe there are some simple checks that could be done to throw a warning / error if, say, lon lat for stations aren't close to the coord center.

vbalderdash commented 4 years ago

With the lack of the check on format, it has been added. Maybe an check on whether the values read in are floats/strings would work? Pandas can infer the widths instead of specifying it. I have not had luck with that method, but maybe it's something that would be useful to troubleshoot.

deeplycloudy commented 4 years ago

Cool, I just swapped this in to my glueviz example (now added, too) and it works great. I also added some minimal packaging stuff so pyxlma can be installed as a package. Thanks!

Technically we've fulfilled the basics of this issue. Are we ready to start splitting out a separate issues for specific things like the header i/o? Or keep this as an open issue until it's a bit more mature?

vbalderdash commented 4 years ago

It's probably fine to split it into smaller issues and work on the specific things along the way.

I just played around with the glueviz example, looks like it will be useful!

deeplycloudy commented 4 years ago

Closing this issue, having opened #6 to continue the discussion about the remaining issue about the header format.