JackieMium / my_blog

在 Issues 中建立的个人博客
GNU General Public License v3.0
31 stars 12 forks source link

跟着 mimic-code 探索 MIMIC 数据之 tutorials (一) #24

Open JackieMium opened 6 years ago

JackieMium commented 6 years ago

LongRoad

SQL 算是学完了,结果回去看 mimic-code 发现大多数脚本根本看不懂!想起来小学做数学习题:

  • 课本例题: 小明有 3 个苹果,吃了 1 个,请问小明还有几个苹果 ?

  • 课后习题:小华前天买了 5 个橘子,昨天吃了 1 个梨,请问小红今天还剩下几个苹果?

  • :卒.....

没有办法,我就把 mimic-code 翻来覆去地看,看看有没有什么我能看懂的。果然,MIMIC 很良心的,mimic-code/tutorials/ 里面就放了针对新人的几个简单的小课程,小课程搭配习题,答案也有,可以说是很好了。

一个个来看吧。


1. sql-intro

这个文档基本就是教我们 SQL 的了。基本上我就是泛泛地看了看。有几个值得记下来的:

How can we use temporary tables to help manage queries?

临时的表格可以用 WITH foo AS bar 这样的语法来存放。比如我们要得到从 patients 表格中得出年龄并做他用:

WITH patient_dates AS (
SELECT p.subject_id, p.dob, a.hadm_id, a.admittime,
    ( (cast(a.admittime as date) - cast(p.dob as date)) / 365.2 ) as age
FROM patients p
INNER JOIN admissions a
ON p.subject_id = a.subject_id
ORDER BY subject_id, hadm_id
)
SELECT *
FROM patient_dates;

另一个办法是使用 materialised views,即物化视图:

-- we begin by dropping any existing views with the same name
DROP MATERIALIZED VIEW IF EXISTS patient_dates_view;
CREATE MATERIALIZED VIEW patient_dates_view AS
SELECT p.subject_id, p.dob, a.hadm_id, a.admittime,
    ( (cast(a.admittime as date) - cast(p.dob as date)) / 365.2 ) as age
FROM patients p
INNER JOIN admissions a
ON p.subject_id = a.subject_id
ORDER BY subject_id, hadm_id;

CASE statement for if/else logic

CASE WHEN 是简单的逻辑判断语句。比如我们想对 icustays 中 ICU 住院时间长短 (los) 分组:

-- Use if/else logic to categorise length of stay
-- into 'short', 'medium', and 'long'
SELECT subject_id, hadm_id, icustay_id, los,
    CASE WHEN los < 2 THEN 'short'
         WHEN los >=2 AND los < 7 THEN 'medium'
         WHEN los >=7 THEN 'long'
         ELSE NULL END AS los_group
FROM icustays;

Window functions

Window functions 中文好像翻译为窗口函数,这个窗口其实是艾滋病感染潜伏期窗口那个意思,在 Bowtie2 之类的 RNA-Seq 数据比对之类的软件计算比对质量的时候也会用到这个这个概念。 不知道为什么 SQLBolt 竟然没有涉及到 Window functions,感觉很实用的功能。

Window functions 和 Aggregate 很像,但是 Aggregate 是聚合,会按照我们要求对相同的行进行合并,而 Window functions 则不同。用例子来看会很清楚,比如我们想要对同一个病人多次住 ICU 进行编号。这种情况下直接用 GROUP BY subject_id 会直接把同一个病人信息合并到一行,而我们想要的是每个病人每次入 ICU 的信息仍然单独是一行,顺序通过 admission_time 进行编号。这里的 就是 subject_id ,每个病人为一个处理单位,RANK() 生成顺序编号。 代码:

-- find the order of admissions to the ICU for a patient
SELECT subject_id, icustay_id, intime,
    RANK() OVER (PARTITION BY subject_id ORDER BY intime)
FROM icustays;

有了这样一个编号我们就可以很方便的筛选只住过一次 ICU 的病例了(这个在文献里经常看到):

-- select patients from icustays who've stayed in ICU for only once
WITH icustayorder AS (
SELECT subject_id, icustay_id, intime,
  RANK() OVER (PARTITION BY subject_id ORDER BY intime)
FROM icustays
)
SELECT *
FROM icustayorder
WHERE rank = 1;

Multiple temporary views

多个临时视图,这个在 mimic-code 里简直不要太常见。

services 表格包含了病人接受治疗的情况(比如是在外科还是内科这种):

-- find the care service provided to each hospital admission
SELECT subject_id, hadm_id, transfertime, prev_service, curr_service
FROM services;

但是这个表格里没有 icustay_id,我们只能通过 hadm_idJOIN

WITH serv as (
  SELECT subject_id, hadm_id, transfertime, prev_service, curr_service
  FROM services
)
, icu as
(
  SELECT subject_id, hadm_id, icustay_id, intime, outtime
  FROM icustays
)
SELECT icu.subject_id, icu.hadm_id, icu.icustay_id, icu.intime, icu.outtime
, serv.transfertime, serv.prev_service, serv.curr_service
FROM icu
INNER JOIN serv
ON icu.hadm_id = serv.hadm_id

但是,这个过程其实中间是有一些猫腻的。INNER JOIN 是取交集的:

Alt text

那么取完后的结果的行数肯定不多于之前的数据。但是我们看看我们的数据:

WITH serv as (
  SELECT subject_id, hadm_id, transfertime, prev_service, curr_service
  FROM services
)
, icu as
(
  SELECT subject_id, hadm_id, icustay_id, intime, outtime
  FROM icustays
)
SELECT COUNT(*)
FROM icu
INNER JOIN serv
ON icu.hadm_id = serv.hadm_id

这个在我电脑上显示 78840 行,那我们再看 icustays 数据:

SELECT count(*)
FROM icustays;

61532行。哈哈,INNER JOIN 之后行数变多了,刺激!

下面很快给出了解释:

事实是,每个 hadm_id 可能对应了好几个 service 和好几个 icustay_id ,即一个病人院内转科和多次住 ICU 的情况。所以当通过 hadm_idJOIN 两个表的时候,在 hadm_id 相同而 icustay_idservices 不同时每种组合都会在结果里作为单独的一行。专业的解释:

More technically, the first query joined two tables on non-unique keys: there may be multiple hadm_id with the same value in the services table, and there may be multiple hadm_id with the same value in the admissions table. For example, if the services table has hadm_id = 100001 repeated N times, and the admissions table has hadm_id = 100001 repeated M times, then joining these two on hadm_id will result in a table with NxM rows: one for every pair. With MIMIC, it is generally very bad practice to join two tables on non-unique columns: at least one of the tables should have unique values for the column, otherwise you end up with duplicate rows and the query results can be confusing.

所以最后,我们可以通过在 services 里对相同的hadm_id 利用窗口函数排序,只留下第一个 service 记录,这样 hadm_id 也就变成了 unique key 了。

 WITH serv as (
   SELECT subject_id, hadm_id, transfertime, prev_service, curr_service,
    RANK() OVER (PARTITION BY hadm_id ORDER BY transfertime) as rank
   FROM services
   )
   , icu as
   (
   SELECT subject_id, hadm_id, icustay_id, intime, outtime
   FROM icustays
   )
   SELECT COUNT(*)
   FROM icu
   INNER JOIN serv
   ON icu.hadm_id = serv.hadm_id
   AND serv.rank = 1;

本来打算只写一点点做个笔记,没想到已经这么长了,那干脆分篇好了。

Ref


THE END