Open timronan opened 4 years ago
I updated the test so the attached zip file now only contains one master and one mini ph5 files. Please run ph5tostationxml and the stationxml-validator-1.7.0 and a 111 error will be output. This demonstrates that PH5 is generating station level metadata based on the geographic location of a channel rather than the kef:seed_station_name_s and the time span of the data.
Please run ph5tokef to verify this result.
Seems legit. Are you going to be able to tackle this? Now that Nick is gone who is going to assume the service/output side of PH5?
I will look into it. I am a little bit worried about this issue because it may embedded in the core logic of at least ph5tostationxml. I will report after getting a better understanding of the issue.
If the logic in ph5.clients.ph5tostationxml.get_station_key (line 233) was updated from
def get_station_key(self, station_code, start_date, end_date,
sta_longitude, sta_latitude, sta_elevation, site_name):
return ".".join([str(x) for x in
[station_code, start_date.isoformat(),
end_date.isoformat(), sta_longitude, sta_latitude,
sta_elevation, site_name]])
to
def get_station_key(self, station_code, start_date, end_date, site_name):
return ".".join([str(x) for x in
[station_code, start_date.isoformat(),
end_date.isoformat(), site_name]])
than channels correctly associate with their parent station (stations are not duplicated), but we cannot determine which of the channel coordinate sets should be assigned to the parent station.
Station and channel level coordinates should be independent of one another in seismic metadata, but the ph5/kef logic groups some station and channel information (including geographic information) into one level. Could it be possible to create additional columns in the sorts group so the Data Acquisition System (DAS) and sensor location information could be separated? The DAS location would be assigned to the station level coordinates while the sensor location could be assigned to the channel level coordinates. get_station_key
could use the station code, epoch and DAS locations, which would be added to every kef Table row, to associate stations and channels. Data conflicts would likely still occur but they could be resolved using the information in get_station_key
by changing either the station code or the DAS geographic information.
Are there any thoughts on how to address this issue?
@timronan , I haven' had a chance to dive into this. You're welcome to create a PR to address.
I don't think that changing the PH5 structure would be a good idea at this point, with the plans to develop a more general HDF5 format for seismic and other data, to address this and other issues.
Would it be possible to change the library such that it doesn't map based on location, but on the seed codes within the array tables, and without changing the structure? I'm a bit confused why this was done to begin with.
Only problem is that stationxmltoph5 produces this error: [2020-05-20 23:20:51,895] - ph5.clients.ph5tostationxml - ERROR: list index out of range Why does the above error occur?
This maybe caused by other "master.ph5"s in the dir tree it was run. This was fixed, and a more descriptive error message in recent PR's.
PH5 already uses the SEED codes and the epoch from the array table. The problem is that we can't determine which latitude and longitude need to be assigned to the station that would adopt multiple channels. We may end up with a station with incorrect coordinates.
If a PH5 archive contains stations that have individual sensors located at slightly different (still within the 1 km traditional SEED convention) coordinates, where sensors had the same time span, and each sensor had the same channel name and its own loc code (01, 02, etc.), it looks like current logic would output a stationXML in violation of rule 111?
@hrotman-pic, yes if a kef file contains multiple channels with the same station information but with different coordinates, stationxml creates a station for each of the channels rather than combining the channels into one parent station. Having multiple stations with the same station code and epoch leads to a 111 stationxml validation error. It is incorrect to duplicate station information in a stationxml file. ph5tostationxml associates based on geographic coordinates rather than time. The rest of the IRIS infrastructure uses time association logic.
Thanks, Tim. I thought that's what would happen, but wanted to confirm. There's an archive with stations like those I described in-progress.
@hrotman-pic you should consider stopping that build, running ph5tostationxml on the current built metadata, and then validate the output stationxml. If that validator throws 111 errors on that experiment it probably should be archived in a format other than PH5.
This data format decision seems unfixable but this issue should be left posted as a travelers advisory.
Describe the bug ph5tostationxml maps
kef:seed_station_name_s
tostationxml:station:code
based on geographic coordinates rather than time span. Typos in the coordinates can lead to output stationxml files that have multiple stations with the same code and epoch. This violates rule 111 of the stationxml validator.This is also important because it is feasible to have a channel that has different geographic coordinates than it's parent station but PH5's geographic spatial mapping logic limits these use case.
Environment (please complete the following information):
To Reproduce Generate a master.ph5 file from the attached kef. Run ph5tostationxml using the master.ph5 file. Only problem is that stationxmltoph5 produces this error: [2020-05-20 23:20:51,895] - ph5.clients.ph5tostationxml - ERROR: list index out of range Why does the above error occur?
These steps may have to be done using both master.ph5 and mini*.ph5. Generate kef file from master, change one coordinate in table row with the same seed_station_name_s as another table row. Run keftoph5, Run ph5tostationxml.
Expected behavior To harden against this bug, ph5tostationxml needs to map kef:seed_station_name_s to
stationxml:station:code based
on time-span. All table rows with the samekef:seed_station_name_s
that have overlapping or encompassed time-spans (using the extent from the earliest startDate to the latest endDate is the correct logic) need to be mapped to one station in stationxml, and all associated channels are adopted by this station.spatialMapping.zip