bioturing / hera-t

Hera-T, a fast and accurate tool for estimating gene abundances in single cell data generated by the 10X-Chromium protocol
https://bioturing.com/herat
Other
17 stars 4 forks source link

Reference building for single-nuclei RNA-seq assay #18

Open harsh-shukla opened 4 years ago

harsh-shukla commented 4 years ago

Hi, I have been meaning to try this tool for some time now. We do a lot of single-nuclei RNA-seq, as a result we end up capturing lot of reads from the pre-mrna . In order to deal with that , 10X has a slightly different method (hack) for creating references in which they consider the entire transcript for mapping and not just the exons . Please Refer (https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/advanced/references#premrna)

When I tried to use the pre-mrna gtf (this gtf was used by us previously to run the cell ranger pipeline) generated by steps detailed in the above link - I ran into some error.

Fortunately, after wrestling with few errors initially I did get the hera-t index to run
I finally ended up writing a small python script that created gtf with each transcript having a single exon ( that is what the 10X hack essentially does).

The hera-t index runs now and starts writing entire transcript sequences to the _indexname.fasta file . But somewhere along the way it throws a segmentation fault.

P.S : I think the segmentation fault stems from a memory issue. I deleted the transcript record at which the segmentation fault used to occur (I ran the pipeline a few times and it was the same transcript always) . When I do that the segmentation fault still persist , but now it happens at the next transcript record that is comparatively big (in length) while it does write few small length transcripts that are in between to the fasta file.

Any solution to this issue ?

Best, Harsh

thangtq139 commented 4 years ago

Hi @harsh-shukla,

It is likely that Hera-T at current version cannot index the "transcriptome" that contains pre-mrna sequences because they index the transcriptome using hash table, which may be limited by the total size of the reference sequences. Hope that in your case, there are others bug in indexing algorithm and the developer can fix ASAP.

I find that they use the BWT to index the genome, hopefully, in the next releases, Hera-T will support a better way for UMI counting on intronic region.

Cheers, Thang

harsh-shukla commented 4 years ago

Hi @thangtq139

Thanks for all info. I am still waiting to hear back from the authors. Hopefully its a minor bug

Best, Harsh