Closed splaisan closed 5 years ago
I am sure you have solved this issue by now. chr22p is not accessible to sequencing.
samtools view -c -b -h ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/NHGRI_Illumina300X_novoalign_bams/HG001.hs37d5.300x.bam 22:1-15000000
returns zero.
basically the region of 22:1-15000000 you specified falls into the gaps.
I know. I was wondering if @splaisan knew. This is probably the reason why his file size estimate was off.
Thanks Ben, it is not the reason why I do not get the right data size as there is indeed no sequence data in the short arm fully made of N's. The weird issue was that I was not consistently getting the same file output between attempts like if there would be some packets randomly lost in transit or something wrong with the extraction over internet. I meanwhile downloaded the full file and extracted locally. Thanks for your kind help Best
I am trying to get only chr22 reads from the NIST_NA12878_HG001_HiSeq_300x to build material for a training (the file has a bai index next to it and the command below runs). The 30x downscale data present there is a bit too small for my aim.
samtools view -b -h ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/NHGRI_Illumina300X_novoalign_bams/HG001.hs37d5.300x.bam 22:0-50818468 > HG001.hs37d5.300x_chr22ss.bam
I irreproductively get a file of 5.6 to 5.9GB depending on the attempt which is too small to be the whole 300x chr22 subset (2% of 550GB should be more like 11GB). The records are OK and all from '22' but I fear they are only teh first part of the real data. I have the feeling that some timeout occurs here. I tried curl piped to samtools but this one fails because the access to the bai is not possible.
Can someone confirm that the is a problem with samtools on that link or wether this is OK? Any other alternative to downloading the 550GB to subset locally?
Thanks