d-chambers / Detex

A Python package for subspace detection and waveform similarity clustering
Other
32 stars 6 forks source link

Detex not reading continous waveform data #31

Open SladeBeta opened 8 years ago

SladeBeta commented 8 years ago

Trying to run SVD or detections on data has resulted in Detex being unable to read the Continuous data files.

I have tried running the data in pickle and mseed format. Neither has worked properly.

Using data that was pulled by previous versions of Detex I received this error (file format pickle):

screenshot 2016-03-02 11 37 22


When trying to run SVD on the older data, it skips all waveforms and ends up with no data to run SVD.

screenshot 2016-03-02 11 59 43


Thinking that there might be something wrong with the continuous waveform data, I downloaded a month's worth of continuous data to a new directory. This was completed in both pickle and mseed format using Detex 1.0.6. After the download completed, Detex began to auto-index the ContinuousWaveform directory with this result:

screenshot 2016-03-02 15 33 30


I tried running SVD (after terminating the auto indexing and Detex tried to index again), with this result: screenshot 2016-03-02 15 36 47


Just to try it, I created a subspace with the new data:

screenshot 2016-03-02 15 37 02


I have been able to successfully use detex.pickTimes() and have been able to see those waveforms. Detex has also had no problem reading either pickle or mseed format EventWaveforms. The clusters have been produced without error. I am using fillZeros=True, but all other parameters (minus directory location variables) have remained unchanged.

My TemplateKey: TemplateKey.txt

My Station Key: StationKey.txt

*Both have been switched to ".txt" for uploading.

*If you choose to download the data, the four stations will be about 9 GB of data for the single month. The event waveforms take up approximately 50 MB of disk space.

d-chambers commented 8 years ago

Ok it looks like the problem is caused by detex splitting each part of the path apart to store less data in the SQLite database; this is crude representation of an enum (enumerated type).

Let me walk you through my reasoning for doing this, then we can think about fixes.

If we had a path /eq/POTTER/ContinuousWaveForms/120/event.mseed we could store the whole path in the SQLite database table where the data file info is stored but then the path field would require at least 35 chars. If we assume ascii chars (although they might be unicode, I am not sure) that would require 35 bytes. However, a lot of other events probably share much of the same path, so why not just store a single integer reference to each part of the path?

So in the 'indkey' table you could have something like:

0 1 2 3 4
eq POTTER ContinuousWaveForms 2010 120, 121, 122

Then we have a fields in the "ind" table (one row for each data file) that look like this;

path name
0,0,0,0,0 event.mseed
0,0,0,0,0 event1.mseed

To reconstruct the path we can use the path list and do an os.join from table "indkey" os.path.join(0[0], 1[0], 2[0], 3[0], 4[0]) = eq/POTTER/ContinuousWaveForms/210/120. This then allows us to save only 5 ints to represent the entire 32 char path. The savings could be big if you have a lot of data files.

One thing I have never checked on, however, is whether the sqlite database already does this type of compression (or a better type). If it does, I wasted my time and made the code unnecessarily complicated, so I will check on this and consider a revision.

So the problem in your case is in the reconstruction. In the example above we recover the path "eq/POTTER/ContinuousWaveForms/210/120" which looks like it should exist but it actually doesn't, it needs a "/" at the beginning, so the correct path is "/eq/POTTER/ContinuousWaveForms/210/120". If you are using the index in the same relative path you are running detex from there is no issue, but in your case there is an issue. A quick fix is to simply add another try except bracket in the _checkQuality function of getdata to try to read the path with a os.sep appended to the beginning. It looks like so:

def _checkQuality(stPath):
    """
    load a path to an obspy trace and check quality
    """
    try:
        st = obspy.read(stPath)
    except (TypeError, IOError): # if object is not obspy-readable
        try:
            st = osbpy.read(os.path.join(os.path.sep, stPath))
        except:
            return None
    lengthStream = len(st)
    gaps = st.get_gaps()
    gapsum = np.sum([x[-2] for x in gaps])
    starttime = min([x.stats.starttime.timestamp for x in st])
    endtime = max([x.stats.endtime.timestamp for x in st])
    duration = endtime - starttime
    nc = len(list(set([x.stats.channel for x in st])))
    netsta = st[0].stats.network + '.' + st[0].stats.station
    outDict = {'Gaps': gapsum, 'Starttime' : starttime, 'Endtime' : endtime,
               'Duration' : duration,

There could, however, be a bigger issue if multiple people are using the same data directory and the index created is specific to the user. In which case, the indexDirectory function might need to use abspath and not normpath

def indexDirectory(dirPath):
    """
    Create an index (.index.db) for a directory with stored waveform files
    which also contains quality info of each file

    Parameters
    __________
    dirPath : str
        The path to the directory containing waveform data (any structure)
    """
    columns = ['Path', 'FileName', 'Starttime', 'Endtime', 'Gaps', 'Nc', 'Nt', 
               'Duration', 'Station']
    df = pd.DataFrame(columns=columns) # DataFrame for indexing
    msg = 'indexing, or updating index for %s' % dirPath
    detex.log(__name__, msg, level='info', pri=True)

    # Create a list of possible path permutations to save space in database
    pathList = [] # A list of lists with different path permutations
    for dirpath, dirname, filenames in  os.walk(dirPath):
        dirList = os.path.abspath(dirpath).split(os.path.sep)
...
d-chambers commented 8 years ago

Looks like SQLite doesn't do compression without proprietary plugins: http://stackoverflow.com/questions/10824347/does-sqlite3-compress-data