PacificBiosciences / pbbioconda

PacBio Secondary Analysis Tools on Bioconda. Contains list of PacBio packages available via conda.
BSD 3-Clause Clear License
247 stars 43 forks source link

Confusion about PacBio IDs #611

Closed TinyTasy closed 10 months ago

TinyTasy commented 11 months ago

Operating system Linux

Conda environment

_libgcc_mutex             0.1                        main  
_openmp_mutex             5.1                       1_gnu  
avro-python3              1.9.0                    py37_0    bioconda
bam2fastx                 3.0.0                h9ee0642_0    bioconda
biopython                 1.78             py37h7f8727e_0  
blas                      1.0                         mkl  
bottleneck                1.3.5            py37h7deecbd_0  
brotlipy                  0.7.0           py37h27cfd23_1003  
bzip2                     1.0.8                h7b6447c_0  
c-ares                    1.18.1               h7f8727e_0  
ca-certificates           2023.01.10           h06a4308_0  
certifi                   2022.12.7        py37h06a4308_0  
cffi                      1.15.1           py37h5eee18b_3  
charset-normalizer        2.0.4              pyhd3eb1b0_0  
cryptography              39.0.1           py37h9ce1e76_0  
curl                      7.87.0               h5eee18b_0  
fftw                      3.3.9                h27cfd23_1  
gffread                   0.12.7               h9a82719_0    bioconda
htslib                    1.9                  ha228f0b_7    bioconda
idna                      3.4              py37h06a4308_0  
intel-openmp              2021.4.0          h06a4308_3561  
iso8601                   1.0.2              pyhd3eb1b0_0  
joblib                    1.1.1            py37h06a4308_0  
krb5                      1.19.4               h568e23c_0  
ld_impl_linux-64          2.38                 h1181459_1  
libcurl                   7.87.0               h91b91d3_0  
libdeflate                1.0                  h14c3975_1    bioconda
libedit                   3.1.20221030         h5eee18b_0  
libev                     4.33                 h7f8727e_1  
libffi                    3.4.2                h6a678d5_6  
libgcc-ng                 11.2.0               h1234567_1  
libgfortran-ng            11.2.0               h00389a5_1  
libgfortran5              11.2.0               h1234567_1  
libgomp                   11.2.0               h1234567_1  
libnghttp2                1.46.0               hce63b2e_0  
libssh2                   1.10.0               h8f2d780_0  
libstdcxx-ng              11.2.0               h1234567_1  
lima                      2.7.1                h9ee0642_0    bioconda
mkl                       2021.4.0           h06a4308_640  
mkl-service               2.4.0            py37h7f8727e_0  
mkl_fft                   1.3.1            py37hd3c417c_0  
mkl_random                1.2.2            py37h51133e4_0  
ncurses                   6.4                  h6a678d5_0  
numexpr                   2.8.4            py37he184ba9_0  
numpy                     1.21.5           py37h6c91a56_3  
numpy-base                1.21.5           py37ha15fc14_3  
openssl                   1.1.1t               h7f8727e_0  
packaging                 22.0             py37h06a4308_0  
pandas                    1.3.5            py37h8c16a72_0  
patsy                     0.5.3            py37h06a4308_0  
pbbam                     1.3.0                h8e3dc82_0    bioconda
pbcommand                 2.1.1                      py_2    bioconda
pbcore                    2.1.2                      py_2    bioconda
pbcoretools               0.8.1                      py_1    bioconda
pbmm2                     1.10.0               h9ee0642_0    bioconda
pbskera                   0.1.0                hdfd78af_0    bioconda
pbtk                      3.1.0                h9ee0642_0    bioconda
pip                       22.3.1           py37h06a4308_0  
pycparser                 2.21               pyhd3eb1b0_0  
pyopenssl                 23.0.0           py37h06a4308_0  
pysam                     0.15.3           py37hda2845c_1    bioconda
pysocks                   1.7.1                    py37_1  
python                    3.7.16               h7a1cb2a_0  
python-dateutil           2.8.2              pyhd3eb1b0_0  
pytz                      2022.7           py37h06a4308_0  
readline                  8.2                  h5eee18b_0  
requests                  2.28.1           py37h06a4308_0  
samtools                  1.6                  hb116620_7    bioconda
scikit-learn              1.0.2            py37h51133e4_1  
scipy                     1.7.3            py37h6c91a56_2  
setuptools                65.6.3           py37h06a4308_0  
six                       1.16.0             pyhd3eb1b0_1  
sqlite                    3.40.1               h5082296_0  
statsmodels               0.13.5           py37h7deecbd_1  
suppa                     2.3                        py_2    bioconda
threadpoolctl             2.2.0              pyh0d69192_0  
tk                        8.6.12               h1ccaba5_0  
urllib3                   1.26.14          py37h06a4308_0  
wheel                     0.38.4           py37h06a4308_0  
xz                        5.2.10               h5eee18b_1  
zlib                      1.2.13               h5eee18b_0  

Describe the bug

Hello developers, I'm running into some confusion with the PB IDs. I have used scMAS-ISO-seq on some mouse samples and processed everything with skera, the classical isoseq3 workflow and finally created my Seurat gene and isoform count matrices by using pigeon.

When looking more precisely into my data though, I realized something that I am not really sure about. As far as I understood it, the format PB XX.YY denotes a certain gene with XX and the corresponding isoforms with YY.

In the classification_filtered_lite.txt file, I have the PB ID 28775.YY with multiple isoforms. Neverthelss, this PB ID has two associated genes, Dcaf4 until 28775.45 and starting from 28775.46, Rbm25. This I see for multiple genes, where two or more genes share the same PB.XX ID.

I had assumed that the PB XX. ID should always be the same for a gen. So my question is, whether this is expected behaviour or whether I did something wrong. The correct nomenclature is important for the proceeding of my data analysis, so I really appreciate your help. Especially as I am quite new to LR-data analysis in general. In which step are the PB.XX.YY IDs actually associated with the gene names?

I know that Issue #603 already handles a similar question, where you say that the PacBio ID does not correlate to the genome coordinates, but only based on the order the collapsed files are processed. Nevertheless, I appreciate some more information.

To Reproduce None, as this is a general question after isoseq collapse.

Expected behavior Every gene has it's own PB.XX ID, with the YY part only denoting isoforms.

jmattick commented 10 months ago

Hi @TinyTasy. This is expected behavior. The PB.XX represents a window that can contain any number of neighboring genes. The pbids will be associated with gene names during the pigeon classify step.