MariaNattestad / Ribbon

A genome browser designed for complex structural variants and long reads.
https://genomeribbon.com
MIT License
261 stars 29 forks source link

cram file support? #123

Open dnil opened 1 week ago

dnil commented 1 week ago

Hi, very nice tool - well done! Any plans for supporting cram files? It looked like you are using samtools for handling alignments, but one would then presumably need to pass along a reference genome file/URL. Cheers!

MariaNattestad commented 1 week ago

@robertaboukhalil I believe CRAM is pretty difficult for us to support, did I recall that correctly? Was it something about how samtools assumes certain things about the reference that are hard to replicate in a web environment?

robertaboukhalil commented 1 week ago

I had trouble with CRAM support in the past because samtools kept trying to download the reference from ebi.ac.uk despite me trying to tell it to use a local fasta file. That was a couple years ago and using samtools 1.10 so it's worth another try with a newer version.

robertaboukhalil commented 1 week ago

@dnil Can you share a small CRAM file with a reference FASTA we can test with?

dnil commented 1 week ago

Sure, I could downsample or make a small extract I guess, but GitHub doesn't allow exchange very large files. Please suggest an upload destination/mode and I'll try to give you something! 😊

Or perhaps easier just to grab a NIST or 1000G one? E.g. here: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/data/GBR/HG00099/alignment/HG00099.alt_bwamem_GRCh38DH.20150718.GBR.low_coverage.cram

According to their alignment readme it should be on GRCh38DH - i e ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa

robertaboukhalil commented 1 week ago

Thanks @dnil, that's really useful!

I tried it with those files and I see a similar issue: it first downloads ~2MB from the CRAM file but then tries to download the entire FASTA file from the URL. This is even though it only needs a small part of the FASTA, the FASTA is indexed, and I explicitly specify the index path with -t or ##idx## 🤔

We might have better luck with something like cram.js?

<script src="https://biowasm.com/cdn/v3/aioli.js"></script>
<script type="module">
    const CLI = await new Aioli([{ tool: "samtools", version: "1.17" }]);
    const [path_cram, path_crai, path_fa, path_fai] = await CLI.mount([
        "https://42basepairs.com/download/s3/1000genomes/data/HG00099/alignment/HG00099.alt_bwamem_GRCh38DH.20150718.GBR.low_coverage.cram",
        "https://42basepairs.com/download/s3/1000genomes/data/HG00099/alignment/HG00099.alt_bwamem_GRCh38DH.20150718.GBR.low_coverage.cram.crai",
        "https://42basepairs.com/download/s3/1000genomes/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa",
        "https://42basepairs.com/download/s3/1000genomes/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fai",
    ]);

    const output = await CLI.exec(`samtools view -T ${path_fa} -t ${path_fai} ${path_cram} chr17:7,667,421-7,688,490`);
    // const output = await CLI.exec(`samtools view -T ${path_fa}##idx##${path_fai} ${path_cram} chr17:7,667,421-7,688,490`);
    console.log(output);
</script>
dnil commented 1 week ago

Thank you for trying! Cool wasm system! I wonder if samtools was trying to make a 'REF_CACHE'? And if one could feed it one over https? It kind of sounds like it, but no direct examples on https://www.htslib.org/workflow/cram.html.

robertaboukhalil commented 1 week ago

From this page: _If no REF_PATH is defined, both REF_PATH and REF_CACHE will be automatically set (see above), but if REF_PATH is defined and REFCACHE not then no local cache is used

I tried setting REF_PATH but it's still ignoring it and downloading the whole file 🤷‍♂️