BUStools / bustools

Tools for working with BUS files
https://bustools.github.io/
BSD 2-Clause "Simplified" License
89 stars 22 forks source link

Not an issue but a question #78

Open stevenjblair opened 2 years ago

stevenjblair commented 2 years ago

Hello,

First of all, I love bustools.

With that said I ended up with bustools 0.41.0 on my laptop and it is lovely. I have no idea how it got there, all of my other machines have an older version that gets the job done but is not such a multitool of BUS goodness that I have at hand on my mac. It runs without anaconda or module loader. I would love the tar if you have it handy.

Anyhow, my question is Do you have documentation on bustools v0.41? Would love to see a little info on some of these new (to me) functions like umicorrect, clusterhist, linker. Would love to see if you have a manual like you have for version 0.3x here: https://bustools.github.io/manual

Best regards and keep up the great work! Steve

redst4r commented 2 years ago

Same here, I use bustools alot in everyday work, but I'm a little unclear on some of the new stuff in v.0.41. Some updated documentation would be extremely helpful; I tried to dig through the source to understand some of the newer features, but no luck...

In particular, I'm trying to figure out what bustools count --umi-gene is supposed to do!

Yenaled commented 1 year ago

Hi Steve and @redst4r ,

We apologize for the delay in releasing updated documentation. We are a bit behind on things and most of those features you describe are for exploratory+advanced use cases / analyses that aren't part of the typical scRNA-seq workflow. The new features were used to produce the analyses in this paper: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02386-z

You might find umicorrect useful: It tries to account for sequencing errors in UMIs (e.g. if one read's UMI is "slightly off" from another read's UMI due to a sequencing error, we'll correct the UMIs so that they're the same sequence)

--umi-gene is something that is extremely useful, especially when you have short UMIs. Let's say your UMIs are only 4 base pairs long -- that means there are only 4^4 (=256) possible UMI sequences. That's not good enough to ensure "uniqueness" -- indeed, two distinct molecules might end up with the same UMI sequence (e.g. the sequence TCCG might be assigned to molecule A, an RNA molecule that originated from gene X, as well as molecule B, an RNA molecule that originated from gene Y). If you don't include --umi-gene, the TCCG UMIs will not be counted at all (it'll just be tossed out because bustools count will be unable to figure out why that TCCG UMI sequence belongs to both gene X and gene Y). With --umi-gene, bustools count is able to recognize that we should count that UMI twice: one for gene X and one for gene Y.