Closed cariaso closed 3 years ago
Hi @cariaso thanks for your input. I wasn't able to get it to work so far, for a remote bam like ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239454/NA19239.final.cram
. Does this require a custom build of pysam?
I did most of my own testing with a different cram file in a private bucket, which worked fine.
I seem to recall that a test with that ftp file might also have not worked for me, but I dismissed it as probably being a problem with that file.
I've done most of my work with this in a dockerfile which looks as shown below. (Although I think the autoheader and autoconf may not be needed).
This should clarify the steps for building samtools and pysam. Whether there is an issue with that particular cram, remains to be seen.
FROM amazonlinux
ENV PYCURL_SSL_LIBRARY=openssl ENV PYTHON_INSTALL_LAYOUT= RUN yum -y install gcc git file git python3 python3-devel openssl-devel zlib-devel RUN yum -y install make tar wget openssl-devel autoheader autoconf RUN yum -y install zlib zlib-devel bzip2 bzip2-devel curl libcurl libcurl-devel xz xz-devel RUN wget -q https://github.com/samtools/samtools/releases/download/1.11/samtools-1.11.tar.bz2 -O - | tar jxf - RUN bash -c 'cd /samtools-1.11 && ./configure --without-curses && make all all-htslib && make install install-htslib' RUN pip3 install --no-binary :all: pysam
On Mon, Nov 30, 2020 at 3:11 PM Xiao Chen notifications@github.com wrote:
Hi @cariaso https://github.com/cariaso thanks for your input. I wasn't able to get it to work so far, for a remote bam like ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239454/NA19239.final.cram . Does this require a custom build of pysam?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Illumina/Cyrius/pull/6#issuecomment-735571701, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA6TESJVV2BNHTPFAPO2KTSSMZRRANCNFSM4UFQ2HCQ .
Mike Cariaso http://www.cariaso.com
I was only able to get the remote URL to work for files on Amazon s3, not the ftp ones. In addition, I was only able to get it to work when not using multiprocessing (had to make a few code changes to achieve that, included in #8 ). Will revisit and improve in the future.
I personally only have use for the s3, so no skin off my back. Just wanted to advertise what else should be possible. Thrilled to see this adopted. Glad to to see a bit of my pull made it into the patch.
Thank you for your input @cariaso!
samtools 1.11 has some very useful features relates to urls and index files. One of the few mentions I see of it is at https://github.com/samtools/samtools/blob/4fe33221082adceedfdbf525ced54c1a0883998c/NEWS#L63 But that fails to capture the scope of the enhancements.
One of the simplest and most immediately valuable changes allows
to be replaced by
which for large files is a HUGE benefit. This also works for ftp, http, https, s3 and gs style urls, which is a big win when working inside the amazon or google clouds. Do be aware that samtools doesn't yet benefit from IAM instance roles, so you may need to rely on ~/.aws/credentials or environment variables to specify credentials for non-public urls. see http://www.htslib.org/doc/htslib-s3-plugin.html for more details.
For cram files this also allows the reference genome to be determined at runtime and loaded dynamically (or from cache for performance).
Another benefit is when the index file is at a non-obvious location. This can now be communicated in a backwards compatible way by replacing a simple url with one that includes a '##idx##' delimiter.
samtools view 'https://example.com/path1/NA19239.final.cram##idx##https://example.com/path2/NA19239.final.cram.crai' chr22:1000000-1023000
Using these improvements with Cyrius necessitates passing the index location to pysam. This pull request achieves all of this in a fully backwards compatible way.
All of these lines are now valid manifest entries
which would assume the index is at /home/alice/crams/NA19239.final.cram.crai
or
as well as remote url equivalents such as
mixed remote and local
The weakest aspect of this pull request is the use of a fairly simple check to determine if the file is local or remote. Currently this patch just checks for the existence of '://', but could easily be expanded to use techniques from https://stackoverflow.com/questions/22238090/validating-urls-in-python/22238205 or similar.