path to local files in subdir - Githubissues

ahotovec / REDPy

Repeating Earthquake Detector (Python)

GNU General Public License v3.0

82 stars 39 forks source link

path to local files in subdir #20

Open EvaEibl opened 1 year ago

EvaEibl commented 1 year ago

When providing the searchdir path to the top directory of my files, it does not find the files (which are three levels of subfolders down) but just provides a list of folder names in 'flist' that cannot be read in. I had to manually copy and rearrange my files in order to have them just one level below the top directory.

The description just says: 'If using local files, define the path to the top directory where they exist, ending in / or \ as appropriate for your operating system. If there are files in any subdirectories within this directory, they will be found.'

ahotovec commented 1 year ago

Can you provide me with some additional details on what you're using for your search directory and file pattern? In trigger.py the piece of code that finds the files does a walk of the subfolders, but maybe there's an incompatibility with what you've told it to look for?

EvaEibl commented 1 year ago

Thank you for your fast reply. I've tried to reproduce the error. I previously used server=file searchdir=/path/to/files/MINISEED/2015/VI/ the mseed files were then in subfolders such as: IEB/HHZ.D/VI.IEB..HHZ.D.2015.200 IEB/HHN.D/VI.IEB..HHN.D.2015.200 IEB/HHE.D/VI.IEB..HHE.D.2015.200 IEA/HHE.D/VI.IEA..HHE.D.2015.200 ... The flist now actually contains the mseed files. (so I must have done sth. different this time.) However, the code cannot read in this data and aborts with 'Could not download or trigger data... moving on'.

When I copy the same data into one folder. I.e. here the path would be: server=file searchdir=/path/to/files/MINISEED/2015/VI/folder/ the mseed files are directly in this folder: VI.IEB..HHZ.D.2015.200 VI.IEB..HHN.D.2015.200 VI.IEB..HHE.D.2015.200 VI.IEA..HHE.D.2015.200 The mseed data can be read in.

ahotovec commented 1 year ago

Can you try the first case (mseed files within the subfolders) with the -t flag on backfill.py? Adding this flag removes the try/except surrounding the data reading step and will give us a more detailed failure message than just that it couldn't complete. Let me know what that error message contains and it'll help me track down where the problem is.

EvaEibl commented 1 year ago

The error is: Traceback (most recent call last): File "backfill.py", line 106, in st, stC = redpy.trigger.getData(tstart+nopt.nsec-opt.atrig, endtime, opt) File "/data/REDPy/redpy/trigger.py", line 55, in getData stmp = obspy.read(f, headonly=True) File "/home/eibl/miniconda3/envs/redpy/lib/python3.7/site-packages/decorator.py", line 232, in fun return caller(func, (extras + args), kw) File "/home/eibl/miniconda3/envs/redpy/lib/python3.7/site-packages/obspy/core/util/decorator.py", line 291, in _map_example_filename return func(*args, kwargs) File "/home/eibl/miniconda3/envs/redpy/lib/python3.7/site-packages/obspy/core/stream.py", line 208, in read st = _generic_reader(pathname_or_url, _read, kwargs) File "/home/eibl/miniconda3/envs/redpy/lib/python3.7/site-packages/obspy/core/util/base.py", line 657, in _generic_reader generic = callback_func(pathnames[0], *kwargs) File "/home/eibl/miniconda3/envs/redpy/lib/python3.7/site-packages/decorator.py", line 232, in fun return caller(func, (extras + args), kw) File "/home/eibl/miniconda3/envs/redpy/lib/python3.7/site-packages/obspy/core/util/decorator.py", line 148, in uncompress_file if tarfile.is_tarfile(filename): File "/home/eibl/miniconda3/envs/redpy/lib/python3.7/tarfile.py", line 2442, in is_tarfile t = open(name) File "/home/eibl/miniconda3/envs/redpy/lib/python3.7/tarfile.py", line 1575, in open return func(name, "r", fileobj, **kwargs) File "/home/eibl/miniconda3/envs/redpy/lib/python3.7/tarfile.py", line 1639, in gzopen fileobj = GzipFile(name, mode + "b", compresslevel, fileobj) File "/home/eibl/miniconda3/envs/redpy/lib/python3.7/gzip.py", line 168, in init fileobj = self.myfileobj = builtins.open(filename, mode or 'rb') IsADirectoryError: [Errno 21] Is a directory: '/path/to/files/MINISEED/2015/VI/IEA' Closing remaining open files:redpytable.h5...done

ahotovec commented 1 year ago

Ok, seems that it's complaining that the first item in the list is a directory and it can't read it. In the .cfg file, let's try adding filepattern='*.D.*' as I believe all of the mseed files should contain that and none of the folders will...

EvaEibl commented 1 year ago

Adding this, it just goes like: 2015-09-28T00:00:00.000000Z Couldn't find JOA.HHZ.VI. Couldn't find JOB.HHZ.VI. Couldn't find JOD.HHZ.VI. Couldn't find JOE.HHZ.VI. Couldn't find JOF.HHZ.VI. Couldn't find JOG.HHZ.VI. Couldn't find JOK.HHZ.VI. Couldn't find IEA.HHZ.VI. Couldn't find IEB.HHZ.VI. Couldn't find IED.HHZ.VI. Couldn't find IEE.HHZ.VI. Couldn't find IEF.HHZ.VI. Couldn't find IEG.HHZ.VI. Couldn't find IEY.HHZ.VI. Length of Orphan table: 13 Time spent this iteration: 0.0069476922353108725 minutes

ahotovec commented 1 year ago

And I take it that putting these in the top directory does find the data correctly? I suppose we should also verify that flist does actually contain the filenames of all the data.

EvaEibl commented 1 year ago

It finds the files if I remove the inverted commas. filepattern=star.D.star However, when I'm in the top directory I get some results after 1.5 minutes. When using the data in subfolders it seems to get stuck somewhere. Nothing has happend for 10 minutes now.

ahotovec commented 1 year ago

Ah, yes without the commas. When you moved your files to the top directory, did you move all of them? I'm wondering if there are a lot of files it's trying to read through. I'll readily admit that the way REDPy parses through files on disk is not very efficient.

A path we might consider going down instead is setting up a portable FDSN. It's got a bunch of setup associated with it but once it's going it'll probably be the fastest way to query your data, and might be useful outside of REDPy as well. If you'd like to try this, send me an email (ahotovec-ellis@usgs.gov) and I'll forward you some notes on installing and setting it up from one of my colleagues.

EvaEibl commented 1 year ago

Ok. I see. Yes there are a lot of files in the original folders. Since we have expertise using pyrocko in our group, I think it might be easier to use the pyrocko pile for the reading in (or just copy the event data I want to analyse for the moment), than to setup a portable FDSN for this dataset.

ahotovec commented 1 year ago

Ok, let me know how that goes. I don't usually work with lots of data in files on disk, and tend to favor waveservers and webservices. I've had folks that have their files in directories sorted by date use shell scripts to change the filepath based on what time they are processing to reduce the number of files that REDPy needs to search through. I have some other ideas on better ways to handle it but haven't had a chance to test/implement them.

ahotovec commented 1 year ago

Just putting a quick update here that I've been picking at this issue while "cleaning up" the code. In the branch "cleanup" there is new code that creates a file index of all the files in the data search directory that helps it know which of those files to read once, rather than redoing the query every time step. I've also added options to load a few days of that data into memory for faster access. I've tested it with both large mseed volumes (~1 GB each per channel, containing several months of data each) and ~35k individual sac files from that same time span. It probably isn't as optimized as using a local waveserver, but it's orders of magnitude more efficient now.

I'll probably close this issue when I pull 'cleanup' into the 'master' branch. I'd love it if you could test the new code on your dataset and let me know how it works, and what I can improve to align with your use case.

Were you able to get pyrocko to work?