Read large CSV in chunks

billingross / genomics-england-challenge

Technical challenge for Genomics England bioinformatics engineer role

MIT License

0 stars 0 forks source link

Read large CSV in chunks #2

Open billingross opened 1 month ago

billingross commented 1 month ago

Reading a CSV from S3:

https://medium.com/@kn.lakshmi948/reading-csv-file-from-amazon-s3-bucket-using-csv-module-in-python-2bd1ed48c0ca
Looks like boto3, the AWS SDK allows for reading objects from S3 in chunks if the file is not compressed: https://stackoverflow.com/questions/51085539/streaming-in-chunking-csvs-from-s3-to-python
Use AWS S3 Select to read from BZIP2 compression of CSV file: https://stackoverflow.com/questions/41161006/reading-contents-of-a-gzip-file-from-a-aws-s3-in-python

billingross commented 1 month ago

I want to use AWS S3 Select because then I don't have to move the source object and it natively supports importing BGZIP2 compressed CSVs in chunks. I could implement it in a lambda function. It seems like the perfect solution. (except I have a TSV).

billingross commented 1 month ago

[ ] Decompress tsv.bgz gnomad file
[ ] Read it and see if the data could be useful
[ ] Convert to CSV and BGZIP2 compress
[ ] Try reading with AWS S3 Select

billingross commented 1 month ago

Nevermind, .bgz is a "blocked gzipped file" and is bioinformatics specific (so as to be unusable).

billingross commented 1 month ago

Just looked at the gnomad data:

$ head gnomad.genomes.v4.1.allele_number_all_sites.tsv
locus   AN
chr1:10001      16
chr1:10002      78
chr1:10003      200
chr1:10004      948
chr1:10005      1774
chr1:10006      2374
chr1:10007      2410
chr1:10008      3494
chr1:10009      3908

This is not useful.

billingross commented 1 month ago

RefSNP data is in BZIP2 compressed JSON format, which is supported by AWS S3 Select. I'm going to try that instead.

billingross commented 1 month ago

Example script for parsing RefSNP data: https://ftp.ncbi.nih.gov/snp/latest_release/JSON/rsjson_demo.py

billingross commented 1 month ago

Create a lambda function to read file