genome-in-a-bottle / giab_data_indexes

This repository contains data indexes from NIST's Genome in a Bottle project.
232 stars 71 forks source link

trying to download chr22 subset with samtools #3

Closed splaisan closed 5 years ago

splaisan commented 6 years ago

I am trying to get only chr22 reads from the NIST_NA12878_HG001_HiSeq_300x to build material for a training (the file has a bai index next to it and the command below runs). The 30x downscale data present there is a bit too small for my aim.

samtools view -b -h ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/NHGRI_Illumina300X_novoalign_bams/HG001.hs37d5.300x.bam 22:0-50818468 > HG001.hs37d5.300x_chr22ss.bam

I irreproductively get a file of 5.6 to 5.9GB depending on the attempt which is too small to be the whole 300x chr22 subset (2% of 550GB should be more like 11GB). The records are OK and all from '22' but I fear they are only teh first part of the real data. I have the feeling that some timeout occurs here. I tried curl piped to samtools but this one fails because the access to the bai is not possible.

Can someone confirm that the is a problem with samtools on that link or wether this is OK? Any other alternative to downloading the 550GB to subset locally?

Thanks

gabeng commented 5 years ago

I am sure you have solved this issue by now. chr22p is not accessible to sequencing.

samtools view -c -b -h ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/NHGRI_Illumina300X_novoalign_bams/HG001.hs37d5.300x.bam 22:1-15000000

returns zero.

chunlinxiao commented 5 years ago

basically the region of 22:1-15000000 you specified falls into the gaps.

chunlinxiao commented 5 years ago

https://www.ncbi.nlm.nih.gov/nuccore/NC_000022.10

gabeng commented 5 years ago

I know. I was wondering if @splaisan knew. This is probably the reason why his file size estimate was off.

splaisan commented 5 years ago

Thanks Ben, it is not the reason why I do not get the right data size as there is indeed no sequence data in the short arm fully made of N's. The weird issue was that I was not consistently getting the same file output between attempts like if there would be some packets randomly lost in transit or something wrong with the extraction over internet. I meanwhile downloaded the full file and extracted locally. Thanks for your kind help Best