Closed jsturdy closed 5 years ago
What about using an inner join in a single query ?
query=('SELECT data.* FROM CMS_GEM_MUON_VIEW.{0} data'
'INNER JOIN (SELECT vfat3_barcode, MAX(run_number) AS run_number FROM CMS_GEM_MUON_VIEW.{0} GROUP BY vfat3_barcode) data_select'
'ON data.vfat3_barcode = data_select.vfat3_barcode AND data.run_number = data_select.run_number').format(view)
It is working with pandas
inside the python
interpreter.
[Works with the assumption that the pair (vfat3_barcode, run_number)
is unique.]
What about using an inner join in a single query ?
query=('SELECT data.* FROM CMS_GEM_MUON_VIEW.{0} data' 'INNER JOIN (SELECT vfat3_barcode, MAX(run_number) AS run_number FROM CMS_GEM_MUON_VIEW.{0} GROUP BY vfat3_barcode) data_select' 'ON data.vfat3_barcode = data_select.vfat3_barcode AND data.run_number = data_select.run_number').format(view)
It is working with
pandas
inside thepython
interpreter.[Works with the assumption that the pair
(vfat3_barcode, run_number)
is unique.]
No preference, your solution seems a bit faster (0.02s vs 0.08s in sql_developer
with a restriction to some 10 specific VFATs, and roughly equal times with no restriction), and is a bit cleaner
```python query=('SELECT data.* FROM CMS_GEM_MUON_VIEW.{0} data' 'INNER JOIN (SELECT vfat3_barcode, MAX(run_number) AS run_number FROM CMS_GEM_MUON_VIEW.{0} GROUP BY vfat3_barcode) data_select' 'ON data.vfat3_barcode = data_select.vfat3_barcode AND data.run_number = data_select.run_number').format(view)
> What about using an inner join in a single query ? > ```python > query=('SELECT data.* FROM CMS_GEM_MUON_VIEW.{0} data' > 'INNER JOIN (SELECT vfat3_barcode, MAX(run_number) AS run_number FROM CMS_GEM_MUON_VIEW.{0} GROUP BY vfat3_barcode) data_select' > 'ON data.vfat3_barcode = data_select.vfat3_barcode AND data.run_number = data_select.run_number').format(view) > ``` > > > It is working with `pandas` inside the `python` interpreter. > [Works with the assumption that the pair `(vfat3_barcode, run_number)` is unique.] No preference, your solution seems a bit faster (0.02s vs 0.08s in `sql_developer` with a restriction to some 10 specific VFATs, and roughly equal times with no restriction), and is a bit cleaner
So using @lpetre-ulb's query (needed to add two spaces at end of lines) we have a really fast query with the full DB:
In [2]: getGEMDBView("GEM_VFAT3_PROD_SUMMARY_V_RH",debug=True)
query = SELECT data.* FROM CMS_GEM_MUON_VIEW.GEM_VFAT3_PROD_SUMMARY_V_RH data INNER JOIN (SELECT vfat3_barcode, MAX(run_number) AS run_number FROM CMS_GEM_MUON_VIEW.GEM_VFAT3_PROD_SUMMARY_V_RH GROUP BY vfat3_barcode) data_select ON data.vfat3_barcode = data_select.vfat3_barcode AND data.run_number = data_select.run_number
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1993 entries, 0 to 1992
Data columns (total 33 columns):
vfat3_ser_num 1993 non-null object
vfat3_barcode 1993 non-null object
hw_id_version 1863 non-null float64
buffer_offset 1883 non-null float64
vref_adc 1883 non-null float64
v_bgr 1883 non-null float64
adc0m 1940 non-null float64
adc0b 1940 non-null float64
adc1m 1940 non-null float64
adc1b 1940 non-null float64
cal_dacm 1924 non-null float64
cal_dacb 1924 non-null float64
iref 1913 non-null float64
mean_thrshld 1865 non-null float64
mean_enc 1865 non-null float64
register_test 1883 non-null float64
ec_errors 1871 non-null float64
bc_errors 1871 non-null float64
crc_errors 1871 non-null float64
hit_errors 1871 non-null float64
noisy_channels 1865 non-null float64
dead_channels 1865 non-null float64
bist 1870 non-null float64
scan_chain 779 non-null float64
sleep_pwr_analog 1884 non-null float64
sleep_pwr_digital 1884 non-null float64
run_pwr_analog 1867 non-null float64
run_pwr_digital 1867 non-null float64
location 1350 non-null object
temperature 1792 non-null object
state 1872 non-null object
temperature_k2 899 non-null float64
run_number 1993 non-null int64
dtypes: float64(27), int64(1), object(5)
memory usage: 513.9+ KB
Read 1993 rows from view GEM_VFAT3_PROD_SUMMARY_V_RH
However it seems like only columns vfat3_ser_num
, vfat3_barcode
and run_number
are returned for every row. But this could be that the data was just never entered.
vfat3_ser_num vfat3_barcode hw_id_version buffer_offset vref_adc v_bgr adc0m adc0b ... sleep_pwr_digital run_pwr_analog run_pwr_digital location temperature state temperature_k2 run_number
1973 0x1b76 7030 196609.0 1.3125 3.0 342.500 1.91431 -323.352 ... 83.1294 227.5670 103.3640 CERN 14-5-28 509.374 yellow -105.7990 26
1974 0x1b77 7031 196609.0 1.2500 3.0 337.938 1.90202 -313.542 ... 86.5780 211.5550 106.3720 CERN 14-5-28 473.895 yellow -99.1382 26
1975 0x1b78 7032 196609.0 3.6875 3.0 346.375 1.93318 -327.015 ... 79.2409 199.6960 103.4160 CERN 14-5-28 465.59 green -96.6469 26
1976 0x47 71 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN None None red NaN 26
1977 0x48 72 196609.0 1.1250 3.0 344.188 1.92559 -322.946 ... 66.2382 277.5890 84.1675 None None None NaN 26
1978 0x49 73 196609.0 2.1250 3.0 342.500 1.91736 -316.938 ... 66.8892 301.4830 83.9739 None None None NaN 26
1979 0x4a 74 196609.0 2.4375 3.0 332.750 1.87028 -321.114 ... 67.3114 221.0040 84.0091 CERN 14-5-28 499.938 green -106.5100 26
1980 0x4b 75 196609.0 2.9375 3.0 345.812 1.94609 -329.394 ... 66.6604 218.9450 84.7129 None 445.152 green -92.6543 26
1981 0x8 8 NaN NaN NaN NaN 2.01358 -301.412 ... NaN NaN NaN None None green NaN 26
1982 0x22b8 8888 196609.0 1.8750 3.0 346.938 1.95359 -314.997 ... 80.3142 181.9600 100.0380 CERN 14-5-28 446.901 green -91.5159 26
1983 0x9 9 NaN NaN NaN NaN 2.03480 -329.537 ... NaN NaN NaN None None green NaN 26
1984 0x5a 90 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN None None None NaN 26
1985 0x5b 91 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN None None None NaN 26
Gonna try the same with @jsturdy's query.
I had to adapt @jsturdy's example slightly but the gist is:
dbName = os.getenv("GEM_ONLINE_DB_NAME")
dbConn = os.getenv("GEM_ONLINE_DB_CONN")
colquery="select COLUMN_NAME from ALL_TAB_COLUMNS where TABLE_NAME='{}'".format(view)
listOfColumns = pd.read_sql(colquery,con=(dbConn+dbName))
columns = ",".join([ str(colName) for colName in listOfColumns.column_name ])
query = "select * from (select {:s},max(RUN_NUMBER) over (partition by VFAT3_BARCODE) as LATEST_RUN_NUMBER from CMS_GEM_MUON_VIEW.{:s}) data where RUN_NUMBER = LATEST_RUN_NUMBER".format(columns,view)
if vfatList is not None:
query += getVFATFilter(vfatList)
pass
if debug:
print("query = {0}{1}{2}".format(colors.YELLOW,query,colors.ENDC))
pass
dfGEMView = pd.read_sql(query, con=(dbConn+dbName))
And this comes up with:
In [31]: getGEMDBView("GEM_VFAT3_PROD_SUMMARY_V_RH",debug=True)
query = SELECT data.* FROM CMS_GEM_MUON_VIEW.GEM_VFAT3_PROD_SUMMARY_V_RH data INNER JOIN (SELECT vfat3_barcode, MAX(run_number) AS run_number FROM CMS_GEM_MUON_VIEW.GEM_VFAT3_PROD_SUMMARY_V_RH GROUP BY vfat3_barcode) data_select ON dat
a.vfat3_barcode = data_select.vfat3_barcode AND data.run_number = data_select.run_number
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1993 entries, 0 to 1992
Data columns (total 33 columns):
vfat3_ser_num 1993 non-null object
vfat3_barcode 1993 non-null object
hw_id_version 1863 non-null float64
buffer_offset 1883 non-null float64
vref_adc 1883 non-null float64
v_bgr 1883 non-null float64
adc0m 1940 non-null float64
adc0b 1940 non-null float64
adc1m 1940 non-null float64
adc1b 1940 non-null float64
cal_dacm 1924 non-null float64
cal_dacb 1924 non-null float64
iref 1913 non-null float64
mean_thrshld 1865 non-null float64
mean_enc 1865 non-null float64
register_test 1883 non-null float64
ec_errors 1871 non-null float64
bc_errors 1871 non-null float64
crc_errors 1871 non-null float64
hit_errors 1871 non-null float64
noisy_channels 1865 non-null float64
dead_channels 1865 non-null float64
bist 1870 non-null float64
scan_chain 779 non-null float64
sleep_pwr_analog 1884 non-null float64
sleep_pwr_digital 1884 non-null float64
run_pwr_analog 1867 non-null float64
run_pwr_digital 1867 non-null float64
location 1350 non-null object
temperature 1792 non-null object
state 1872 non-null object
temperature_k2 899 non-null float64
run_number 1993 non-null int64
dtypes: float64(27), int64(1), object(5)
memory usage: 513.9+ KB
Read 1993 rows from view GEM_VFAT3_PROD_SUMMARY_V_RH
And looking at this dump it's the same as above. The records 1973 - 1985 in this case are:
1973 0x1b76 7030 196609.0 1.3125 3.0 342.500 1.91431 -323.352 ... 83.1294 227.5670 103.3640 CERN 14-5-28 509.374 yellow -105.7990 26
1974 0x1b77 7031 196609.0 1.2500 3.0 337.938 1.90202 -313.542 ... 86.5780 211.5550 106.3720 CERN 14-5-28 473.895 yellow -99.1382 26
1975 0x1b78 7032 196609.0 3.6875 3.0 346.375 1.93318 -327.015 ... 79.2409 199.6960 103.4160 CERN 14-5-28 465.59 green -96.6469 26
1976 0x47 71 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN None None red NaN 26
1977 0x48 72 196609.0 1.1250 3.0 344.188 1.92559 -322.946 ... 66.2382 277.5890 84.1675 None None None NaN 26
1978 0x49 73 196609.0 2.1250 3.0 342.500 1.91736 -316.938 ... 66.8892 301.4830 83.9739 None None None NaN 26
1979 0x4a 74 196609.0 2.4375 3.0 332.750 1.87028 -321.114 ... 67.3114 221.0040 84.0091 CERN 14-5-28 499.938 green -106.5100 26
1980 0x4b 75 196609.0 2.9375 3.0 345.812 1.94609 -329.394 ... 66.6604 218.9450 84.7129 None 445.152 green -92.6543 26
1981 0x8 8 NaN NaN NaN NaN 2.01358 -301.412 ... NaN NaN NaN None None green NaN 26
1982 0x22b8 8888 196609.0 1.8750 3.0 346.938 1.95359 -314.997 ... 80.3142 181.9600 100.0380 CERN 14-5-28 446.901 green -91.5159 26
1983 0x9 9 NaN NaN NaN NaN 2.03480 -329.537 ... NaN NaN NaN None None green NaN 26
1984 0x5a 90 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN None None None NaN 26
1985 0x5b 91 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN None None None NaN 26
So these two methods are indeed similar. But I agree that @lpetre-ulb's implementation is cleaner so I think I will prepare a PR with that.
Closed by #230
Brief summary of issue
The current DB query selects only VFATs with
RUN_NUMBER
set to the maximum value in the table, rather than selecting from all VFATs, only the record with the highestRUN_NUMBER
This results in missing VFATs when running any scripts, sometimes working, but usually only by luck.Types of issue
Expected Behavior
Any query should select uniquely one VFAT, and if multiple records exist for the same VFAT, the one with the highest
RUN_NUMBER
will be returnedPossible Solution (for bugs)
Changing the query to something like:
Context (for feature requests)
Tested with
python
interpreter and withsql_developer