[BUG] pforma is sorting SmartSolo data by file name, not das serial number

ascire-pic commented 2 years ago

pforma is sorting SmartSolo SEGD data into folders for processing by file name only without checking the headers to determine the station ID. This is a problem because PH5 uses the line + station ID to create the das_serial_number field (ie 1X3) used in the das tables and array table to identify the data. As a result, if more than one SmartSolo was deployed at the same station in a rolling deployment, the data for that das_serial_number can end up in multiple mini-ph5 files, which does not work.

This problem only occurs loading data with pforma. Loading data via command line with segdtoph5 does not have the same problem as segdtoph5 appears to check the station ID to determine which mini-ph5 file the data belongs to.

pforma v2021.84

2 example data sets on Passcal's ph5 server: /external/ph5/Working/test-alissa/Rolling_Deployment_Test/Rolling_1-3-1 Raw data in folders Rolling_Deployment_Run_1 & Rolling_Deployment_Run_2 Deployment details: Deployment 1: 453013373 @ 1X1, 453012520 @ 1X2 Deployment 2: 453013692 @ 1X1, 453012571 @ 1X2 Deployment 3: 453013373 @ 1X1, 453012520 @ 1X2

pforma output: Files were sorted into 4 processing directories, A, B, C, D. Final merged PH5 has 4 mini-ph5 files. Data from 2520 was processed in A, ended up in miniPH5_00001.ph5 with das_serial_number 1X2 Data from 2571 was processed in B, ended up in miniPH5_00002.ph5 with das_serial_number 1X2 Data from 3373 was processed in C, ended up in miniPH5_00003.ph5 with das_serial_number 1X1 Data from 3692 was processed in D, ended up in miniPH5_00004.ph5 with das_serial_number 1X1

/external/ph5/Working/test-alissa/Rolling_Deployment_Test/Rolling_1-3-2 Raw data in folders First_Deployment & Second_Deployment Deployment details: Deployment 1: 453013373 @ 1X1, 453012520 @ 1X2 Deployment 2: 453013692 @ 1X1, 453012571 @ 1X2 Deployment 3: 453012520 @ 1X1, 453013373 @ 1X2

pforma output: Files were sorted into 4 processing directories, A, B, C, D. Final merged PH5 has 4 mini-ph5 files. Data from 2520 was processed in A, ended up in miniPH5_00001.ph5 with das_serial_number 1X2 & 1X1 Data from 2571 was processed in B, ended up in miniPH5_00002.ph5 with das_serial_number 1X2 Data from 3373 was processed in C, ended up in miniPH5_00003.ph5 with das_serial_number 1X1 & 1X2 Data from 3692 was processed in D, ended up in miniPH5_00004.ph5 with das_serial_number 1X1

Expected behavior: The SmartSolo file names only have the device serial number (ex 453012571.1.2022.01.21.20.07.26.000.Z.segd). Since PH5 uses the line and station ID from the SEGD headers to create the das_serial_number used in the das table and arrays tables, pforma should check the headers to divide data up into the processes based on station id instead of just file name.

hrotman-pic commented 1 year ago

If the solution I suggested, a version of unsimpleton for SmartSolo data, is used to address this, an example of desired resulting filename format is:

SSolo<Node S/N>_.segd

For example: 453005500.0001.2020.07.02.15.19.44.000.E.segd --> SSolo_1_2105_453005500_0001_E.segd

Desired usage example: unsimpleton -f <input list> -d <directory for new filenames> --hardlinks

If possible I'd like to keep pforma functionality for ingesting the current SmartSolo naming because not all experiments need this. Because SmartSolo experiments have more input files than Fairfield experiments of comparable duration & number of stations, I expect running this tool will take a few minutes. If implemented the tool might need to handle >10,000 files

@akram-pic what do you think of this file renaming, etc.?

damhuonglan commented 1 year ago

@hrotman-pic Can you explain more about the new name that you recommend for smartsolo data file please? In your example:

1 for ???
2015 for ???
453005500 for serial number
0001 for ???

hrotman-pic commented 1 year ago

Of course.

1 for array number 2015 for station ID 0001 for SEGD file number for that serial number and channel.

Instead of 0001, start date-time (human or epoch) makes sense too because it will also help form a unique filename to use but the date is longer and sometimes it's useful to see the whole filename in the DAS table entry.

Examples of the first two days for all channels for this node are: 453005500.0001.2020.07.02.15.19.44.000.E.segd --> SSolo_1_2105_453005500_0001_E.segd 453005500.0001.2020.07.02.15.19.44.000.N.segd --> SSolo_1_2105_453005500_0001_N.segd 453005500.0001.2020.07.02.15.19.44.000.Z.segd --> SSolo_1_2105_453005500_0001_Z.segd 453005500.0002.2020.07.03.00.00.00.000.E.segd --> SSolo_1_2105_453005500_0002_E.segd 453005500.0002.2020.07.03.00.00.00.000.N.segd --> SSolo_1_2105_453005500_0002_N.segd 453005500.0002.2020.07.03.00.00.00.000.Z.segd --> SSolo_1_2105_453005500_0002_Z.segd

damhuonglan commented 1 year ago

Added code to pforma to read header from smartsolo files to get info to sort file into mini ph5 files. However this way take long time to process. When time is available, will implement a tool to read header for info to create map in a text file with format:

:X Then make pforma to read from this map file instead from name list file. This way the header reading process will be separated.

damhuonglan commented 10 months ago

Improve the process by implement a tool to create a map file to map between filenames and the corresponding array and station with format: <pathtofile>:<array>X<station>

Pforma will load the map file to read file and use the corresponding <array>X<station> as das name to add das info.

hrotman-pic commented 3 months ago

Resolved by 513 and 520.

PIC-IRIS / PH5

[BUG] pforma is sorting SmartSolo data by file name, not das serial number #502