PacificBiosciences / pbbioconda

PacBio Secondary Analysis Tools on Bioconda. Contains list of PacBio packages available via conda.
BSD 3-Clause Clear License
251 stars 45 forks source link

pigeon classify does not produce an output #589

Closed idarolti closed 1 year ago

idarolti commented 1 year ago

Operating system Linux

Package name pigeon --classify command pigeon 1.0.0 (commit -v1.0.0) Using: pbbam : 2.2.0 (commit v2.2.0-1-g8c081f6) pbcopper : 2.1.0 (commit v2.1.0) boost : 1.77 htslib : 1.14 zlib : 1.2.11

Conda environment For reference, for pbpigeon and isoseq3 I had to download the binaries directly because when installing with conda I kept getting segfaults (similar to #568), hence why they are not in the conda environment list below.

# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main
_openmp_mutex             4.5                       1_gnu
brotlipy                  0.7.0           py39h27cfd23_1003
ca-certificates           2021.7.5             h06a4308_1
certifi                   2021.5.30        py39h06a4308_0
cffi                      1.14.6           py39h400218f_0
chardet                   4.0.0           py39h06a4308_1003
conda                     4.10.3           py39h06a4308_0
conda-package-handling    1.7.3            py39h27cfd23_1
cryptography              3.4.7            py39hd23ed53_0
idna                      2.10               pyhd3eb1b0_0
ld_impl_linux-64          2.35.1               h7274673_9
libffi                    3.3                  he6710b0_2
libgcc-ng                 9.3.0               h5101ec6_17
libgomp                   9.3.0               h5101ec6_17
libstdcxx-ng              9.3.0               hd4cf53a_17
ncurses                   6.2                  he6710b0_1
openssl                   1.1.1k               h27cfd23_0
pip                       21.1.3           py39h06a4308_0
pycosat                   0.6.3            py39h27cfd23_0
pycparser                 2.20                       py_2
pyopenssl                 20.0.1             pyhd3eb1b0_1
pysocks                   1.7.1            py39h06a4308_0
python                    3.9.5                h12debd9_4
readline                  8.1                  h27cfd23_0
requests                  2.25.1             pyhd3eb1b0_0
ruamel_yaml               0.15.100         py39h27cfd23_0
setuptools                52.0.0           py39h06a4308_0
six                       1.16.0             pyhd3eb1b0_0
sqlite                    3.36.0               hc218d9a_0
tk                        8.6.10               hbc83047_0
tqdm                      4.61.2             pyhd3eb1b0_1
tzdata                    2021a                h52ac0ba_0
urllib3                   1.26.6             pyhd3eb1b0_1
wheel                     0.36.2             pyhd3eb1b0_0
xz                        5.2.5                h7b6447c_0
yaml                      0.2.5                h7b6447c_0
zlib                      1.2.11               h7b6447c_3

Describe the bug When running the command pigeon classify <sorted.gff> <annotations.gtf> <reference.fa> --num-threads 12 --log-level INFO no output is produced despite running for more than 24h.

Error message No error message or warnings produced

To Reproduce I have followed this workflow (https://isoseq.how/classification/workflow.html) for classifying isoforms. The reference genome annotation was originally in gff format, so I used AGAT to convert it to gtf. I then sorted and indexed the reference annotation GTF file and sorted the input transcript GFF file (completed with no error messages). However, when running pigeon classify no output is produced.

To test the command on a smaller dataset, I subset the input transcript GFF and reference annotation GTF files to just one gene and ran pigeon classify, which worked without issues. But if I increase the number of genes even by a few then it gets stuck with no output or log messages being produced.

Expected behavior I expect pigeon output files to be produced or at least an error message to be printed

greensii commented 1 year ago

i had this same problem. see my reply here and see if that helps!

Magdoll commented 1 year ago

Hi @idarolti ,

Pigeon is pretty particular with GTF formats so in addition to the kind response from @greensii above you might also want to check the following constraints:

The pigeon GTF format req is below

A tab-delimited 9-column file per ImageGFF/GTF File Format Column 1 must be the chromosome Column 2 is ignored Column 3 will only be processed if it is gene, transcript, or exon. All other types are ignored? Column 4 & 5 are start/end Column 6 & 8 are ignored? Column 7 is the strand which must be + or - (does it give error if it is neither +/-)? Column 9 is attribute, AKA free text string, but to be properly processed it must contain a minimal of the following, separated by semicolon. Ex: gene_id "ENSG0001"; transcript_id "ENST000A"; gene_name "TP53";

Let us know how this goes. Thanks! -Liz