NeurodataWithoutBorders / lindi

Linked Data Interface (LINDI) - cloud-friendly access to NWB data
BSD 3-Clause "New" or "Revised" License
2 stars 1 forks source link

[Idea] Experiment with indexing transcriptomics data on NeMO #73

Open rly opened 4 months ago

rly commented 4 months ago

Transcriptomics data on the NeMO archive are often stored as ascii text files (fastq, fasta, mex) that are sometimes tarballed, and sometimes gzipped. I have also found tarballed BAM files (binary).

You can index the files in a tarball with byte ranges using the tarball header. And supposedly you can also index gzipped files and decompress byte ranges of those as well.

Example BICCN data: https://data.nemoarchive.org/biccn/grant/u01_lein/lein/transcriptome/sncell/10x_v3/ https://data.nemoarchive.org/biccn/grant/u01_lein/linnarsson/transcriptome/sncell/10x_v2/human/processed/CellRanger5/ https://data.nemoarchive.org/biccn/grant/u19_huang/arlotta/transcriptome/sncell/10x_v2/mouse/processed/align/

Some of these data files can be very large, and a user may want to access only particular elements of the data file without having to download the entire file. I wonder if we can use LINDI to create an efficient JSON index of specific data elements within a NeMO-hosted dataset for streaming and local access. Just an idea right now as we brainstorm for the grant proposal.

BDBags can be used to index and download particular files of a dataset but I don't know if this works within a tarball or within a FASTQ file.