COMBINE-lab / alevin-fry

🐟 🔬🦀 alevin-fry is an efficient and flexible tool for processing single-cell sequencing data, currently focused on single-cell transcriptomics and feature barcoding.
https://alevin-fry.readthedocs.io
BSD 3-Clause "New" or "Revised" License
169 stars 15 forks source link

Feature request - Rhapsody whole genome sequences #106

Open stela2502 opened 1 year ago

stela2502 commented 1 year ago

Hi, with the help of Rob I managed to get a Rust BD Rhapsody analysis tool to work (https://github.com/stela2502/Rustody). It supports all versions of BD's beads at the moment, but not the (new) whole genome part.

Rhapsody data consists of a variety of data: Antibody tags that are coded in R2 as well as Sample Tags coded there, too. In a new version they also have a TCR and BCR sequencing options. I have coded the targeted sequence matches using 32bp kmers from the provided fasta files ( only ~ 500 fasta sequences). But I am not very interested to blow this up to whole genome analysis. Hence my question: Is there any way you could implement the BD cell ids in alevin fry? Could be as simple as to use my cellids class. Or could I use your library to map all sequences my tool can not handle? The second option would of cause be my favorite. Like only use your genome representation and the mapper. It would be absolutely epic if you could help me here!

Thank you!

rob-p commented 1 year ago

Hi, with the help of Rob I managed to get a Rust BD Rhapsody analysis tool to work (https://github.com/stela2502/Rustody).

Awesome!

Regarding the actual feature request, could I ask for a bit more clarification? I'm not sure what the requested feature is exactly, or what the requirement would be for alevin-fry to do. Specifically, I think it would be most useful if you could give an idea of what the provided input would be, and what the requested output would be in either of these cases to help us understand the request better. Also, I'm looping in @DongzeHE so he's in the know ;P.

--Rob

stela2502 commented 1 year ago

Hi Rob, what I would dream up would be a way to use likely salmon here https://github.com/stela2502/Rustody/blob/8befa2e774caba0f5037b57ceb23bac7d18bac8d/src/bin/quantify_rhapsody.rs#L348. I have seen that salmon is actually a cpp library. The point in my program is where I have tried to match the R2 read to any gene the tool knows of. As there was no hit it now should look genome wide. And I would like to NOT implement a genome wide search :-D As I know of no Rust library that could help me here I tried to think further with storing my gene data as some kind of Index file, but failed horribly. I am not even able to read the data I wrote before :-(. The test here https://github.com/stela2502/Rustody/blob/4f36750ceaf8068c90813a94182f5f6a0f381d0e/this/src/geneids.rs#L580 fails. I seams that (1) the km (u64) ids from the file are not the same that I wrote and (2) I am also unable to read any gene name back (not utf8 formated). Although my Linux system shows the gene names just fine both with a zcat and vim. I tried yesterday and even asked ChatGPG but could not fix that. Possible you spot my error immediately? If you can please help me. Otherwise I pause for some time now. Would be cool to get a genome wide mapper in Rust, but I do not have the time for something like that at the moment. I have never done that either so it would be quite a mission for me. If you (or anybody else) have interest in that I would be very happy for any help I can get.

stela2502 commented 1 year ago

So to sum the long one up once more: I need to somehow map the R2 read to a genome wide index and do not have the time to implement that as I fail at the most basic stuff. I am no trained informatics guy after al :-( And I would like to utilize whatever alevin-fry uses. I fear this will be complicated as the mapper is coded in cpp (if I am not mistaken). Hence I also think about implementing a genome wide mapping functionality. But I fear that is too complicated for me.

jashapiro commented 1 year ago

I just found this issue when investigating whether we would easily be able to adapt our pipeline to accommodate BD Rhapsody data. It looks like the R1 sequence structure (as described in their doc (pdf) could be accommodated by salmon --bc-geometry for the "Original" version, but the "Enhanced 3'" files have variable positions (the "diversity insert") that I am not sure how we would specify using that flag, if it is possible at all.

rob-p commented 1 year ago

Hi @jashapiro,

Is your use-case for single-cell transcriptomics? We are currently working on a "general" solution to such problems — with increasingly complicated barcoding mechanisms. Currently the simpleaf -> alevin-fry pipeline has somewhat more generic support due to it's ability to specify the geometry with the fragment geometry description language. However, there are even more involved solutions necessary in some cases. Our general purpose approach isn't ready yet, but, in the meantime, might it be possible to use a tool like Interstellar to transform the data into an appropriately "normalized" format prior to processing with existing single-cell tools?

stela2502 commented 1 year ago

Hi, at the end I just (re-)implemented a whole genome enabled mapper. That thing now uses a u16 representation of a 8bp fragment of the read to identify a most likely region in a u16::MAX long vector of 8pb-32bp downstream mappers. This does work on the targeted approach and should be able to scale it up to whole genome. You can look at it here: https://github.com/stela2502/Rustody/blob/new_mapper/this/src/fast_mapper.rs. For the cell barcodes I simply use a partial match and get the highest probability for a sequence to be linking to one cell. Not a full length match as that would also generate some issues with sequencing errors. This allows then for some fuzziness in the matching regions, too. From a first glance at interstellar - are you sure it does implement a way to convert from a variable to a fixed format?

stela2502 commented 1 year ago

I normally see up to 80% PCR duplicates in the data. So I am not sure if thinking about catching each and every read is even worth it. I would not assume that the final counts change in a meaningful way.

colindaven commented 1 year ago

@jashapiro I would be interested in any approach for dealing with variable bases or "diversity inserts" in modern (2023) BD Rhapsody data. The library structure is detailed here as well: https://teichlab.github.io/scg_lib_structs/methods_html/BD_Rhapsody.html

The official CWL pipeline has not been too satisfactory for us.

rob-p commented 1 year ago

cc @noahcape & @Daniel-Liu-c0deb0t: Could this modern BD Rhapsody data be a usecase for seqproc+ANTISEQUENCE? Can we see what would be required to perform this transformation?