utterances-bot commented 2 years ago

Hail - (5) Matrix Table - Spark with Genomics

Matrix Table

https://a7420174.github.io/hail/Hail-5/

beomjinjang commented 2 years ago

대용량의 VCF 파일을 mt 로 변환하는 과정에서 질문이 있습니다. 메모리가 부족하여, 염색체별로 나누어서 mt 로 변환하여 합치는 방법은 어떻게 해야하는지 궁금합니다.

a7420174 commented 2 years ago

안녕하세요. 질문 감사합니다. 저도 염색체별로 구분된 VCF를 처리할 때가 꽤 있는데요. 관련해서 코드를 보내드립니다.

chr_list = ['chr' + str(x) for x in range(1,23)] + ['chrX', 'chrY']
vcf_paths = [vcf_dir + 'ALL.'+chr+'.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz' for chr in chr_list]
recode = {f"{i}":f"chr{i}" for i in (list(range(1, 23)) + ['X', 'Y'])}
tables = [hl.import_vcf(vcf_path , force_bgz = True, contig_recoding = recode) for vcf_path in vcf_paths]
mt = hl.MatrixTable.union_rows(*tables)

다음은 1000 Genomes Project 데이터베이스에 존재하는 VCF들을 Hail을 통해서 불러왔을 때 사용했던 코드입니다. 리스트안에 matrixtable을 넣어주고 union_rows 메서드를 이용하여 합치면 됩니다.

a7420174 / a7420174.github.io

hail/Hail-5/ #2

Hail - (5) Matrix Table - Spark with Genomics