Transcriptomics data on the NeMO archive are often stored as ascii text files (fastq, fasta, mex) that are sometimes tarballed, and sometimes gzipped. I have also found tarballed BAM files (binary).
Some of these data files can be very large, and a user may want to access only particular elements of the data file without having to download the entire file. I wonder if we can use LINDI to create an efficient JSON index of specific data elements within a NeMO-hosted dataset for streaming and local access. Just an idea right now as we brainstorm for the grant proposal.
BDBags can be used to index and download particular files of a dataset but I don't know if this works within a tarball or within a FASTQ file.
Transcriptomics data on the NeMO archive are often stored as ascii text files (fastq, fasta, mex) that are sometimes tarballed, and sometimes gzipped. I have also found tarballed BAM files (binary).
You can index the files in a tarball with byte ranges using the tarball header. And supposedly you can also index gzipped files and decompress byte ranges of those as well.
Example BICCN data: https://data.nemoarchive.org/biccn/grant/u01_lein/lein/transcriptome/sncell/10x_v3/ https://data.nemoarchive.org/biccn/grant/u01_lein/linnarsson/transcriptome/sncell/10x_v2/human/processed/CellRanger5/ https://data.nemoarchive.org/biccn/grant/u19_huang/arlotta/transcriptome/sncell/10x_v2/mouse/processed/align/
Some of these data files can be very large, and a user may want to access only particular elements of the data file without having to download the entire file. I wonder if we can use LINDI to create an efficient JSON index of specific data elements within a NeMO-hosted dataset for streaming and local access. Just an idea right now as we brainstorm for the grant proposal.
BDBags can be used to index and download particular files of a dataset but I don't know if this works within a tarball or within a FASTQ file.