Confusion about PacBio IDs #611

TinyTasy closed 10 months ago

TinyTasy commented 11 months ago

Operating system Linux

Conda environment

Describe the bug

Hello developers, I'm running into some confusion with the PB IDs. I have used scMAS-ISO-seq on some mouse samples and processed everything with skera, the classical isoseq3 workflow and finally created my Seurat gene and isoform count matrices by using pigeon.

When looking more precisely into my data though, I realized something that I am not really sure about. As far as I understood it, the format PB XX.YY denotes a certain gene with XX and the corresponding isoforms with YY.

In the classification_filtered_lite.txt file, I have the PB ID 28775.YY with multiple isoforms. Neverthelss, this PB ID has two associated genes, Dcaf4 until 28775.45 and starting from 28775.46, Rbm25. This I see for multiple genes, where two or more genes share the same PB.XX ID.

I had assumed that the PB XX. ID should always be the same for a gen. So my question is, whether this is expected behaviour or whether I did something wrong. The correct nomenclature is important for the proceeding of my data analysis, so I really appreciate your help. Especially as I am quite new to LR-data analysis in general. In which step are the PB.XX.YY IDs actually associated with the gene names?

I know that Issue #603 already handles a similar question, where you say that the PacBio ID does not correlate to the genome coordinates, but only based on the order the collapsed files are processed. Nevertheless, I appreciate some more information.

To Reproduce None, as this is a general question after isoseq collapse.

Expected behavior Every gene has it's own PB.XX ID, with the YY part only denoting isoforms.

jmattick commented 10 months ago

Hi @TinyTasy. This is expected behavior. The PB.XX represents a window that can contain any number of neighboring genes. The pbids will be associated with gene names during the pigeon classify step.