Illumina / Cyrius

A tool to genotype CYP2D6 with WGS data
Other
47 stars 5 forks source link

support remote URIs #6

Closed cariaso closed 3 years ago

cariaso commented 3 years ago

samtools 1.11 has some very useful features relates to urls and index files. One of the few mentions I see of it is at https://github.com/samtools/samtools/blob/4fe33221082adceedfdbf525ced54c1a0883998c/NEWS#L63 But that fails to capture the scope of the enhancements.

One of the simplest and most immediately valuable changes allows

wget ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239454/NA19239.final.cram
samtools view -H NA19239.final.cram

to be replaced by

samtools view -H ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239454/NA19239.final.cram

which for large files is a HUGE benefit. This also works for ftp, http, https, s3 and gs style urls, which is a big win when working inside the amazon or google clouds. Do be aware that samtools doesn't yet benefit from IAM instance roles, so you may need to rely on ~/.aws/credentials or environment variables to specify credentials for non-public urls. see http://www.htslib.org/doc/htslib-s3-plugin.html for more details.

For cram files this also allows the reference genome to be determined at runtime and loaded dynamically (or from cache for performance).

Another benefit is when the index file is at a non-obvious location. This can now be communicated in a backwards compatible way by replacing a simple url with one that includes a '##idx##' delimiter.

samtools view 'https://example.com/path1/NA19239.final.cram##idx##https://example.com/path2/NA19239.final.cram.crai' chr22:1000000-1023000

Using these improvements with Cyrius necessitates passing the index location to pysam. This pull request achieves all of this in a fully backwards compatible way.

All of these lines are now valid manifest entries

/home/alice/crams/NA19239.final.cram

which would assume the index is at /home/alice/crams/NA19239.final.cram.crai

or

/home/alice/crams/NA19239.final.cram##idx##/home/alice/indexes/NA19239.final.cram.crai

as well as remote url equivalents such as

ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239454/NA19239.final.cram
ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239454/NA19239.final.cram##idx##ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239454/NA19239.final.cram.crai

mixed remote and local

ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239454/NA19239.final.cram##idx##/home/alice/indexes/mylocal.cram.crai
s3://myprivatebucket/path/to/myfile.cram
s3://myprivatebucket/path/to/myfile.cram##idx##s3://myprivatebucket/index/path/myfile.cram.crai

The weakest aspect of this pull request is the use of a fairly simple check to determine if the file is local or remote. Currently this patch just checks for the existence of '://', but could easily be expanded to use techniques from https://stackoverflow.com/questions/22238090/validating-urls-in-python/22238205 or similar.

xiao-chen-xc commented 3 years ago

Hi @cariaso thanks for your input. I wasn't able to get it to work so far, for a remote bam like ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239454/NA19239.final.cram . Does this require a custom build of pysam?

cariaso commented 3 years ago

I did most of my own testing with a different cram file in a private bucket, which worked fine.

I seem to recall that a test with that ftp file might also have not worked for me, but I dismissed it as probably being a problem with that file.

I've done most of my work with this in a dockerfile which looks as shown below. (Although I think the autoheader and autoconf may not be needed).

This should clarify the steps for building samtools and pysam. Whether there is an issue with that particular cram, remains to be seen.

FROM amazonlinux

ENV PYCURL_SSL_LIBRARY=openssl ENV PYTHON_INSTALL_LAYOUT= RUN yum -y install gcc git file git python3 python3-devel openssl-devel zlib-devel RUN yum -y install make tar wget openssl-devel autoheader autoconf RUN yum -y install zlib zlib-devel bzip2 bzip2-devel curl libcurl libcurl-devel xz xz-devel RUN wget -q https://github.com/samtools/samtools/releases/download/1.11/samtools-1.11.tar.bz2 -O - | tar jxf - RUN bash -c 'cd /samtools-1.11 && ./configure --without-curses && make all all-htslib && make install install-htslib' RUN pip3 install --no-binary :all: pysam

On Mon, Nov 30, 2020 at 3:11 PM Xiao Chen notifications@github.com wrote:

Hi @cariaso https://github.com/cariaso thanks for your input. I wasn't able to get it to work so far, for a remote bam like ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239454/NA19239.final.cram . Does this require a custom build of pysam?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Illumina/Cyrius/pull/6#issuecomment-735571701, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA6TESJVV2BNHTPFAPO2KTSSMZRRANCNFSM4UFQ2HCQ .

--

Mike Cariaso http://www.cariaso.com

xiao-chen-xc commented 3 years ago

I was only able to get the remote URL to work for files on Amazon s3, not the ftp ones. In addition, I was only able to get it to work when not using multiprocessing (had to make a few code changes to achieve that, included in #8 ). Will revisit and improve in the future.

cariaso commented 3 years ago

I personally only have use for the s3, so no skin off my back. Just wanted to advertise what else should be possible. Thrilled to see this adopted. Glad to to see a bit of my pull made it into the patch.

xiao-chen-xc commented 3 years ago

Thank you for your input @cariaso!