alexdobin / STAR

RNA-seq aligner
MIT License
1.87k stars 506 forks source link

HDF5/each individual transcript molecule info in STARSolo output #1148

Open mbatiuk opened 3 years ago

mbatiuk commented 3 years ago

Hi,

First of all, thanks for making STARSolo.

Is there any way to get HDF5 as output from STARSolo pipeline after mapping/counting 10x droplet data? Or any other way to get each individual transcript molecule information

HDF5 is one of the standard outputs in 10x cellranger, here is the description of the format: https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/advanced/h5_matrices

HDF5 file is needed for certain downstream tools, such as swappedDrops in DropletUtils. Here is the description:

https://github.com/MarioniLab/DropletUtils/issues/59

For example, swappedDrops removes counts that could be artificially generated due to sample barcode swapping while sequencing single cell libraries on Illumina patterned flow cells. This is known problem when reads from one sample in multiplexed sequencing run could appear as reads from another sample: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6039488/

And swappedDrops needs information on molecule-level: UMI, assigned gene and assigned cell for each individual transcript molecule; and this is provided in HDF5 file by cellranger.

While developers of swappedDrops informed that HDF5 is not a strict requirement, other file types providing molecule info will do

alexdobin commented 3 years ago

Hi @mbatiuk

creating hdf5 with the exact 10X format will be somewhat complicated as their file format description is not very detailed. On the other hand, creating a simple text file that would contain UMI sequence, gene and CB would be relatively easy. In principle, this information can be extracted from the BAM file (GX,CB,UB tags).

Cheers Alex

ghost commented 2 years ago

Hi @alexdobin Thank you so much for developing this tool and your continuous support. I am trying to use the matrix files generated by STARsolo to perform clustering. However, as far as I understand, Seurat requires the HDF5 file as an input, which is not available in the outputs of STARsolo. Any guidance for a beginner on how to start performing clustering would be highly appreciated. I would prefer recommending a command-line tool that builds on STARsolo outputs, but any information on that would be great. Thank you so much. Cheers, Yousry

alexdobin commented 2 years ago

Hi Yousry,

I think there is a function to read STARsolo results into Seurat: https://rdrr.io/cran/Seurat/man/ReadSTARsolo.html

Cheers Alex