Open dnil opened 2 months ago
@robertaboukhalil I believe CRAM is pretty difficult for us to support, did I recall that correctly? Was it something about how samtools assumes certain things about the reference that are hard to replicate in a web environment?
I had trouble with CRAM support in the past because samtools kept trying to download the reference from ebi.ac.uk despite me trying to tell it to use a local fasta file. That was a couple years ago and using samtools 1.10 so it's worth another try with a newer version.
@dnil Can you share a small CRAM file with a reference FASTA we can test with?
Sure, I could downsample or make a small extract I guess, but GitHub doesn't allow exchange very large files. Please suggest an upload destination/mode and I'll try to give you something! 😊
Or perhaps easier just to grab a NIST or 1000G one?
E.g. here:
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/data/GBR/HG00099/alignment/HG00099.alt_bwamem_GRCh38DH.20150718.GBR.low_coverage.cram
According to their alignment readme it should be on GRCh38DH - i e ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa
Thanks @dnil, that's really useful!
I tried it with those files and I see a similar issue: it first downloads ~2MB from the CRAM file but then tries to download the entire FASTA file from the URL. This is even though it only needs a small part of the FASTA, the FASTA is indexed, and I explicitly specify the index path with -t
or ##idx##
🤔
We might have better luck with something like cram.js?
<script src="https://biowasm.com/cdn/v3/aioli.js"></script>
<script type="module">
const CLI = await new Aioli([{ tool: "samtools", version: "1.17" }]);
const [path_cram, path_crai, path_fa, path_fai] = await CLI.mount([
"https://42basepairs.com/download/s3/1000genomes/data/HG00099/alignment/HG00099.alt_bwamem_GRCh38DH.20150718.GBR.low_coverage.cram",
"https://42basepairs.com/download/s3/1000genomes/data/HG00099/alignment/HG00099.alt_bwamem_GRCh38DH.20150718.GBR.low_coverage.cram.crai",
"https://42basepairs.com/download/s3/1000genomes/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa",
"https://42basepairs.com/download/s3/1000genomes/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fai",
]);
const output = await CLI.exec(`samtools view -T ${path_fa} -t ${path_fai} ${path_cram} chr17:7,667,421-7,688,490`);
// const output = await CLI.exec(`samtools view -T ${path_fa}##idx##${path_fai} ${path_cram} chr17:7,667,421-7,688,490`);
console.log(output);
</script>
Thank you for trying! Cool wasm system! I wonder if samtools was trying to make a 'REF_CACHE'? And if one could feed it one over https? It kind of sounds like it, but no direct examples on https://www.htslib.org/workflow/cram.html.
From this page: _If no REF_PATH is defined, both REF_PATH and REF_CACHE will be automatically set (see above), but if REF_PATH is defined and REFCACHE not then no local cache is used
I tried setting REF_PATH but it's still ignoring it and downloading the whole file 🤷♂️
Yes, I confirm the same on the js console/Safari Sources. Too bad!
When testing the same command as above, also with the same URLs, with samtools 1.21 or 1.18 from a local shell, they seem way too quick to download any large files (screenshots below). I'm sure there are all kinds of intricacies on how samtools
/htslib
decides to range request or not there. I feel I know too little about the Aioli mounts and htslib to understand the issue at this point. 😞
A faint hope is that it was different with 1.17? Is wasm-compiling a new samtools
version a lot of work? From their release notes, htslib
did at least some reworking on cram access when they fixed the cram bugs in the versions prior to 1.18. Those bugs alone are probably reason enough to have a later version if intended for cram access.
time samtools view -T https://42basepairs.com/download/s3/1000genomes/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa -t https://42basepairs.com/download/s3/1000genomes/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa.fai https://42basepairs.com/download/s3/1000genomes/data/HG00099/alignment/HG00099.alt_bwamem_GRCh38DH.20150718.GBR.low_coverage.cram chr17:7,667,421-7,688,490
Unfortunately using the same example with the biowasm samtools 1.21 candidate did not solve the issue out of the box.
I am however still curious to test the same with a local mirror of the REF_PATH
md5s set. As far as I understand, that is kind of the way it is supposed to work for remote references (e.g. https://github.com/samtools/samtools/issues/837), so a more proper test. Since we would not really see attempts to download from EBI http://www.ebi.ac.uk/ena/cram/md5
. If I understand this correctly, we really should not be able to access it from within the assembled module without "mounting" it.
I made a silly test with mounting a REF_PATH
containing the md5s for the GRCh38 reference, but again I don't know enough about aioli/emscripten to properly set it. I naively tried
CLI.preRun = () => { ENV.REF_PATH = ref_path };
in between mounting, and running the command, which didn't work. How does one do it? Thank you kindly for the support!
This is what I tried previously:
const CLI = await new Aioli([
"samtools/1.17",
"coreutils/env/8.32"
]);
const [path_cram, path_crai, path_fa, path_fai] = await CLI.mount([
"https://42basepairs.com/download/s3/1000genomes/data/HG00099/alignment/HG00099.alt_bwamem_GRCh38DH.20150718.GBR.low_coverage.cram",
"https://42basepairs.com/download/s3/1000genomes/data/HG00099/alignment/HG00099.alt_bwamem_GRCh38DH.20150718.GBR.low_coverage.cram.crai",
"https://42basepairs.com/download/s3/1000genomes/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa",
"https://42basepairs.com/download/s3/1000genomes/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fai",
]);
console.log(await CLI.exec("env REF_PATH=/data/shared"))
console.log(await CLI.exec("env"))
const output = await CLI.exec(`samtools view -T ${path_fa} -t ${path_fai} ${path_cram} chr17:7,667,421-7,688,490`);
console.log(output);
but it still downloads the whole file
I see! Nice with the env tool, but how did we populate the /data/shared
with the output from the samtools perl script? That should be rather exactly 3G of reference genome data?
One way or the other we are trying to emulate the behaviour of http://www.ebi.ac.uk/ena/cram/md5.
The common version to avoid hammering them when running like large scale stuff is to first locally run the samtools script seq_cache_populate.pl
, with splits the fasta reference into convenient hash keyed chunks. Similar to the ena
, but by default in subfolders instead of with a full url lookup.
Just mounting the url as a dir would be too naive - I still tested below. Is there like a way to have dynamic endpoint URLs mounted, or to crawl urls to mount all encountered endpoints as files?
<script src="https://biowasm.com/cdn/v3/aioli.js"></script>
<script type="module">
const CLI = await new Aioli(["coreutils/env/8.32", "coreutils/ls/8.32", {
tool: "samtools",
version: "1.21",
urlPrefix: "http://localhost:80/biowasm/build/samtools/1.21",
}]);
const [path_cram, path_crai, path_fa, path_fai, ref_path] = await CLI.mount([
"https://42basepairs.com/download/s3/1000genomes/data/HG00099/alignment/HG00099.alt_bwamem_GRCh38DH.20150718.GBR.low_coverage.cram",
"https://42basepairs.com/download/s3/1000genomes/data/HG00099/alignment/HG00099.alt_bwamem_GRCh38DH.20150718.GBR.low_coverage.cram.crai",
"https://42basepairs.com/download/s3/1000genomes/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa",
"https://42basepairs.com/download/s3/1000genomes/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fai",
"http://localhost/refs/cache"
]);
console.log(await CLI.exec(`env REF_PATH=${ref_path}/%2s/%2s/%s`))
console.log(await CLI.exec("env"))
console.log(await CLI.exec(`ls ${ref_path}/`))
console.log(await CLI.exec(`ls ${ref_path}/00`))
const output = await CLI.exec(`samtools view -T ${path_fa} -t ${path_fai} ${path_cram} chr17:7,667,421-7,688,490`);
console.log(output);
document.getElementById("output").innerHTML = output;
</script>
<h4>Output of <code>samtools view cram file</code>:</h4>
<pre id="output">Loading...</pre>
Hmm, why do we need to populate that folder? According to this page, "if REF_PATH is defined and REF_CACHE not then no local cache is used". By setting REF_PATH, we're disabling the cache, so that shouldn't be the problem, right? The issue I'm seeing is that samtools is reading through the entire FASTA file for some reason, instead of using a small subset of it 🤔
Oh oups... the problem is that I'm using the wrong .fai URL 🤦
This works just fine:
// Initialize Aioli with samtools and seqtk
const CLI = await new Aioli([
"samtools/1.17",
"coreutils/env/8.32"
], { debug: true });
const [path_cram, path_crai, path_fa, path_fai] = await CLI.mount([
"https://42basepairs.com/download/s3/1000genomes/data/HG00099/alignment/HG00099.alt_bwamem_GRCh38DH.20150718.GBR.low_coverage.cram",
"https://42basepairs.com/download/s3/1000genomes/data/HG00099/alignment/HG00099.alt_bwamem_GRCh38DH.20150718.GBR.low_coverage.cram.crai",
"https://42basepairs.com/download/s3/1000genomes/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa",
"https://42basepairs.com/download/s3/1000genomes/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa.fai",
]);
console.log(await CLI.exec("env REF_PATH=/shared/data/"))
console.log(await CLI.exec("env"))
const output = await CLI.exec(`samtools view -T ${path_fa} -t ${path_fai} ${path_cram} chr17:7,667,421-7,688,490`);
console.log(output);
And it only downloads a small subset of the files!
I guess because I used the wrong FAI URL, samtools was rebuilding the index by downloading the whole FASTA file 🤦 🤦 🤦
This is sweet; should be enough for a start. I think it will be useful to test and merge the samtools
1.21 PR on biowasm
anyway to get the htslib
CRAM file fixes made after 1.17.
I imagined for a "production" version one would want to solve the REF_CACHE
issue as well, but as long as the user does not inspect too many loci it should be fine. The refs are certainly smaller than the read piles. If it becomes an issue, it should be solvable with some form of local sharing of that perl-script generated cache, or a remote one that is not the default EBI one.
So, how do we move forward? I could make a PR to start it off perhaps, but I have only skimmed the codebase so I am likely to need some help sooner or later.
Yes, I'll look at merging the samtools 1.21 version, there's an unexpected build error on previous versions, so I'll look into that more closely.
I imagined for a "production" version one would want to solve the REF_CACHE issue as well, but as long as the user does not inspect too many loci it should be fine
The browser should already cache requests when downloading subsets of FASTA files, so that shouldn't be an issue.
So, how do we move forward? I could make a PR to start it off perhaps
We don't have a lot of bandwidth right now to work on Ribbon, though we might later in the year. So feel free to work on it but can't guarantee we'll be able to help and review it in a timely manner :)
Hi, very nice tool - well done! Any plans for supporting cram files? It looked like you are using
samtools
for handling alignments, but one would then presumably need to pass along a reference genome file/URL. Cheers!