논문화 작업 - Githubissues

4pygmalion commented 7 months ago

LLM enhanced semantic similarity 논문의 우수성을 증명하기 위한 근거는 아래와 같습니다.

필요 모델

[x] : - 사전학습모델(Pretext trained model): basemodel checkpoint -> 베니가 확보해줌
[x] : - 쓰빌데이터 파인튜닝모델(fine tunned model): checkpoint -> 타일러의 주피터 노트북에서 찾아야함. 선릉근처에서 베니랑 시간잡고 찾아야할듯 @4pygmalion (http://182.208.81.130:16003/#/experiments/25/runs/344a4bd15cd84895af607e2c36d2a4cd)

필요한 결과물

[x] : Figure1. LaRA. Method overview(https://github.com/4pygmalion/LaRa/blob/main/data/images/LaRa.png) @4pygmalion

[ ] : Table1. Demographic characteristics (In-house dataset의 환자의 내용) @wjeong53

아래와 같은 표

| Variable                 | Eligible participant (N=***) |
|--------------------------|------------------------------|
| Gender (n, (%))          |                              |
|   Male                   |                              |
|   Female                 |                              |
| N phenotypes (mean [SD]) |                              |
| Phenotypes               |                              |
|   Nervous system         |                              |
|   Musculoskeletal system |                              |
|   Head or neck           |                              |
|   Eye                    |                              |
|   Cardiovascular system  |                              |
|   Others                 |                              |
|--------------------------|------------------------------|

=> 의미: 본문내에 큰 의미는 없고, 우리의 환자데이터셋이 엄청 특이케이스로 뽑힌 환자들이 아니며 일반적인 rare disease patients로 분석했다라는 것의 근거만 마련하면 될 것 같습니다. 보통 의학 논문에서 Result1으로 환자의 인구학적 통계들을 많이 제시하는데, 저희도 ML문제이지만, 그 분석대상은 희귀질환환자(의료계)니까, 넣어도 괜찮을 것 같습니다. 이건 의견잇으면 알려주세요. => 기타: 3ASC에서도 비슷하게 논문에 테이블 넣었어요. 아래의 manuscript에 table1을 참고해보셔도 좋을 듯합니다.

[ ] : Figure2. Disease prioritization performance in In-house dataset @wjeong53
- dataset: inhouse dataset
- figure type: boxplot 또는 lineplot을 그려주시면 됩니다.
- figure configuration: X-axis:k, Y-axis: Top-k recall, hue: method(LaRA, Pheno2disease, Informatics content(Resnik기반))
- Expected figure => 의미: 쓰리빌리언이 희귀질환환자 정말많고 RDW(Real world dataset)인데, 이 방법으로 평가했을때, 우리 방법론이 우수했다. RDW라고 주장해야하는게, 실제로 의사들이 증상넣을떄 2~3개정도밖에안넣습니다... 아래의 온라인 데이터셋도 RDW라고는하는데요. 환자의 phenotype수가 20~30개로 매우 많았어요.
- 고려사항: 사전학습을 저희 데이터셋에 미세조정한 모델로 비교, 사전학습만 넣은 모델도 포함하여 비교
[ ] : Figure 3: Disease prioritization performance in publicitly available dataset @wjeong53
- dataset: pheno2disease의 cohort 1 [3]
- 의미: 이 데이터셋 RDW이긴하지만, phenotype수가 매우많아, 현실적인 데이터셋은 아닙니다. 그리고 여러 논문에서 이 데이터셋을 이용하여 벤치마크로 삼는데, 저희 모델 성능이 우수했습니다. figure configuration: X-axis:k, Y-axis: Top-k recall, hue: method(LaRA, Pheno2disease, Informatics content(Resnik기반))
- Expected figure
[ ] : Figure 4: Individual case review (posthoc interpretation) @wjeong53
- Attention을 이용한 posthoc interpretation
- Still's disease 케이스가 좋았던 것 같아요.
- method: @100jy 방법론을 이 이슈페이지에 자세히 작성해주실 수 있으세요?
- JSON 상하차 및 환자 phenotype vector 생성: @4pygmalion , factory pattern으로 JSON으로부터 환자 만들 수 있도록 데이터클레스 메서드 추가

고려사항

Table 1 내에 Inhouse-dataset(d1), Online available dataset(d2)로 할지, d2은 따로언급안할지 고민이 필요할듯
Causal gene level prioritization 할지 말지여부 (추가 실험에 대한 공수가 들것 같아요. 있으면 좋긴한데, causal gene을 기준으로 synthetic patient만들고 다시 학습하는 작업이 필요하긴 해요)

Authorship

1안: Benny, heon (first coauthor), tyler (second) Kyle(Corresponding) : Kyle이 여기 들어는게 맞을까요?
2안: Benny (first author), tyler (second), Heon(Corresponding)
3안: Beeny, tyler (first coauthor), Heon(Corresponding)

References

[1] 3ASC_v2.docx [2] Pheno2disease: Bioinformatics에서 SOTA라고 주장하는 방법론의 논문입니다. https://academic.oup.com/bib/article/24/4/bbad172/7185480?login=false [3] (https://zenodo.org/records/3905420)

100jy commented 7 months ago

attention weight 통한 증상 중요도 해석

파일 참조: 노트북 파일
transformer 구조에서 attention 값은 layer 수 만큼 아래 같은 shape으로 얻어짐
- ["[num_heads, sequence_length, sequence_length]", ...]

Charcot-Marie-Tooth disease을 예시로 보면

(layer, num_heads, sequence_length)으로 전체 mean 집계하면 아래와 같은 예시를 얻을 수 있음

Charcot-Marie-Tooth disease, type 4B2
[('Onion bulb formation', tensor(0.0672, device='cuda:0')),
('Hyporeflexia', tensor(0.0672, device='cuda:0')),
('Kyphoscoliosis', tensor(0.0671, device='cuda:0')),
('Areflexia', tensor(0.0670, device='cuda:0')),
('Decreased motor nerve conduction velocity',
tensor(0.0670, device='cuda:0')),
('Segmental peripheral demyelination/remyelination',
tensor(0.0668, device='cuda:0')),
('Pes cavus', tensor(0.0668, device='cuda:0')),
('Split hand', tensor(0.0667, device='cuda:0')),
('Difficulty walking', tensor(0.0665, device='cuda:0')),
('Talipes equinovarus', tensor(0.0665, device='cuda:0')),
('Steppage gait', tensor(0.0665, device='cuda:0')),
('Juvenile onset', tensor(0.0662, device='cuda:0')),
('Distal amyotrophy', tensor(0.0662, device='cuda:0')),
('Distal muscle weakness', tensor(0.0662, device='cuda:0')),
('Ulnar claw', tensor(0.0661, device='cuda:0'))]

해당 결과가 집계방식에 따라 상이한 결과를 보임
다만 상위에 'Onion bulb formation'에 나오는 결과는 동일했던 것으로 기억함 (확인 필요합니다.)
- 참고: https://www.ncbi.nlm.nih.gov/gtr/conditions/C1858278/

추가로, attention weight 해석에 관한 (논문)[https://arxiv.org/pdf/1906.04341.pdf] 참고해보면 head별로 매우 다른 패턴에 주목하고 있음을 알 수 있음
- 실제 우리 모델에서도 layer별로 attn map 상이한 결과를 보임
- 그럼에도 불구하고 많은 사전연구(<2019)들은 mean, max 집계로 해석함
- 그래서 새로운 해석 방법 제시하는 논문도 존재(참조)
  
  Previous work analyzing how representations are formed by the Transformer’s multi-head attention mechanism focused on either the average or the maximum attention weights over all heads (Voita et al., 2018; Tang et al., 2018), but neither method explicitly takes into account the varying importance of different heads.
추가 참조
- causal model(GPT-2)에서 attn weight 양상
  - 여기서도 head, layer별로 주목하는 것이 다름을 지적함

4pygmalion commented 7 months ago

Legacy code

train_simple_clr.py: Pretrained model 생성
finetune.py: 파인튜닝 모델 생성 해당 실험의 파라미터 http://182.208.81.130:16003/#/experiments/25/runs/89f8a261b7d34bb3be5dc294b538b7ff

4pygmalion commented 7 months ago

manuscript

https://1drv.ms/f/s!Aq49y5RDk75fgtNz-2kX5cPnY6hEcg?e=UNN5Id

Benny 작업 중인 draft 문서 https://docs.google.com/document/d/1Wnn7cBUGG_c9atcVRYDn4vleHBqdEb3fR_23fw118H4/edit

100jy commented 7 months ago

Legacy code

train_simple_clr.py: Pretrained model 생성

finetune.py: 파인튜닝 모델 생성 해당 실험의 파라미터 http://182.208.81.130:16003/#/experiments/25/runs/89f8a261b7d34bb3be5dc294b538b7ff

들어가지네여;;;;

4pygmalion commented 7 months ago

Legacy code train_simple_clr.py: Pretrained model 생성 finetune.py: 파인튜닝 모델 생성 해당 실험의 파라미터 http://182.208.81.130:16003/#/experiments/25/runs/89f8a261b7d34bb3be5dc294b538b7ff

이거 아마 향후 1년간 이 상태일 것 같아요. ㅎㅎ;;;MLflow도 저희가고나서 아무도 안쓰시내요 ㅠㅠ

4pygmalion commented 6 months ago

[ ] phen2disease json 생성

4pygmalion / LaRa

논문화 작업 #2

필요 모델

필요한 결과물

고려사항

Authorship

References

attention weight 통한 증상 중요도 해석

Legacy code

manuscript

Legacy code