billingross / genomics-england-challenge

Technical challenge for Genomics England bioinformatics engineer role
MIT License
0 stars 0 forks source link

Read large CSV in chunks #2

Open billingross opened 1 month ago

billingross commented 1 month ago

Reading a CSV from S3:

billingross commented 1 month ago

I want to use AWS S3 Select because then I don't have to move the source object and it natively supports importing BGZIP2 compressed CSVs in chunks. I could implement it in a lambda function. It seems like the perfect solution. (except I have a TSV).

billingross commented 1 month ago
billingross commented 1 month ago

Nevermind, .bgz is a "blocked gzipped file" and is bioinformatics specific (so as to be unusable).

billingross commented 1 month ago

Just looked at the gnomad data:

$ head gnomad.genomes.v4.1.allele_number_all_sites.tsv
locus   AN
chr1:10001      16
chr1:10002      78
chr1:10003      200
chr1:10004      948
chr1:10005      1774
chr1:10006      2374
chr1:10007      2410
chr1:10008      3494
chr1:10009      3908

This is not useful.

billingross commented 1 month ago

RefSNP data is in BZIP2 compressed JSON format, which is supported by AWS S3 Select. I'm going to try that instead.

billingross commented 1 month ago

Example script for parsing RefSNP data: https://ftp.ncbi.nih.gov/snp/latest_release/JSON/rsjson_demo.py

billingross commented 1 month ago

Create a lambda function to read file