PacificBiosciences / pbbioconda

PacBio Secondary Analysis Tools on Bioconda. Contains list of PacBio packages available via conda.
BSD 3-Clause Clear License
247 stars 43 forks source link

Pigeon classify fails #645

Closed sojichld closed 7 months ago

sojichld commented 7 months ago

Operating system MAC (but I am using through an HPC so redhat linux)

Package name pigeon 1.2.0

Conda environment

# packages in environment at /users/aademilu/.conda/envs/isoseq_env:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                  2_kmp_llvm    conda-forge
amply                     0.1.6              pyhd8ed1ab_0    conda-forge
bcbio-gff                 0.7.0                    pypi_0    pypi
biopython                 1.81                     pypi_0    pypi
blas                      1.0                         mkl    conda-forge
bx-python                 0.10.0                   pypi_0    pypi
bzip2                     1.0.8                h7f98852_4    conda-forge
c-ares                    1.20.1               hd590300_0    conda-forge
ca-certificates           2023.11.17           hbcca054_0    conda-forge
certifi                   2023.11.17         pyhd8ed1ab_0    conda-forge
cogent                    8.0.0                    pypi_0    pypi
coin-or-cbc               2.10.10              h9002f0b_0    conda-forge
coin-or-cgl               0.60.7               h516709c_0    conda-forge
coin-or-clp               1.17.8               h1ee7a9c_0    conda-forge
coin-or-osi               0.108.8              ha2443b9_0    conda-forge
coin-or-utils             2.11.9               hee58242_0    conda-forge
coincbc                   2.10.10           0_metapackage    conda-forge
cupcake                   29.0.0                   pypi_0    pypi
cycler                    0.11.0                   pypi_0    pypi
cython                    0.29.32          py37hd23a5d3_0    conda-forge
docutils                  0.19             py37h89c1867_0    conda-forge
fonttools                 4.38.0                   pypi_0    pypi
htslib                    1.18                 h81da01d_0    bioconda
icu                       72.1                 hcb278e6_0    conda-forge
imageio                   2.31.2                   pypi_0    pypi
intel-openmp              2021.4.0          h06a4308_3561    anaconda
isoseq                    4.0.0                h9ee0642_0    bioconda
keyutils                  1.6.1                h166bdaf_0    conda-forge
kiwisolver                1.4.5                    pypi_0    pypi
krb5                      1.21.2               h659d440_0    conda-forge
ld_impl_linux-64          2.40                 h41732ed_0    conda-forge
libblas                   3.9.0            12_linux64_mkl    conda-forge
libcblas                  3.9.0            12_linux64_mkl    conda-forge
libcurl                   8.2.1                hca28451_0    conda-forge
libdeflate                1.19                 hd590300_0    conda-forge
libedit                   3.1.20191231         he28a2e2_2    conda-forge
libev                     4.33                 h516909a_1    conda-forge
libffi                    3.4.2                h7f98852_5    conda-forge
libgcc-ng                 13.1.0               he5830b7_0    conda-forge
libgfortran-ng            13.2.0               h69a702a_0    conda-forge
libgfortran5              13.2.0               ha4646dd_0    conda-forge
libgomp                   13.1.0               he5830b7_0    conda-forge
libhwloc                  2.9.2           nocuda_h7313eea_1008    conda-forge
libiconv                  1.17                 h166bdaf_0    conda-forge
liblapack                 3.9.0            12_linux64_mkl    conda-forge
liblapacke                3.9.0            12_linux64_mkl    conda-forge
libnghttp2                1.52.0               h61bc06f_0    conda-forge
libnsl                    2.0.0                h7f98852_0    conda-forge
libsqlite                 3.42.0               h2797004_0    conda-forge
libssh2                   1.11.0               h0841786_0    conda-forge
libstdcxx-ng              13.1.0               hfd8a6a1_0    conda-forge
libxml2                   2.11.5               h0d562d8_0    conda-forge
libzlib                   1.2.13               hd590300_5    conda-forge
lima                      2.7.1                h9ee0642_0    bioconda
llvm-openmp               16.0.6               h4dfa4b3_0    conda-forge
matplotlib                3.5.3                    pypi_0    pypi
mkl                       2021.4.0           h8d4b97c_729    conda-forge
mkl-service               2.4.0            py37h402132d_0    conda-forge
mkl_fft                   1.3.1            py37h3e078e5_1    conda-forge
mkl_random                1.2.2            py37h219a48f_0    conda-forge
ncurses                   6.4                  hcb278e6_0    conda-forge
networkx                  2.6.3                    pypi_0    pypi
numpy                     1.20.3           py37h038b26d_0    conda-forge
numpy-base                1.21.5           py37ha15fc14_3    anaconda
openssl                   3.2.1                hd590300_0    conda-forge
packaging                 23.2                     pypi_0    pypi
parasail                  1.3.4                    pypi_0    pypi
pbbam                     2.4.0                h8db2425_0    bioconda
pbccs                     6.4.0                h9ee0642_0    bioconda
pbcopper                  2.3.0                hfce7173_0    bioconda
pbmm2                     1.13.0               h9ee0642_0    bioconda
pbpigeon                  1.2.0                h4ac6f70_0    bioconda
pbtk                      3.1.0                h9ee0642_0    bioconda
pillow                    9.5.0                    pypi_0    pypi
pip                       23.2.1             pyhd8ed1ab_0    conda-forge
pulp                      2.6.0            py37h89c1867_1    conda-forge
pyparsing                 3.1.1              pyhd8ed1ab_0    conda-forge
pysam                     0.21.0                   pypi_0    pypi
python                    3.7.12          hf930737_100_cpython    conda-forge
python-dateutil           2.8.2                    pypi_0    pypi
python_abi                3.7                     3_cp37m    conda-forge
pywavelets                1.3.0                    pypi_0    pypi
readline                  8.2                  h8228510_1    conda-forge
rocm-smi                  5.6.0                h59595ed_1    conda-forge
scikit-image              0.19.3                   pypi_0    pypi
scipy                     1.7.3                    pypi_0    pypi
setuptools                68.0.0             pyhd8ed1ab_0    conda-forge
six                       1.16.0             pyh6c4a22f_0    conda-forge
sqlite                    3.42.0               h2c6b66d_0    conda-forge
tbb                       2021.10.0            h00ab1b0_0    conda-forge
tifffile                  2021.11.2                pypi_0    pypi
tk                        8.6.12               h27826a3_0    conda-forge
typing-extensions         4.7.1                    pypi_0    pypi
wheel                     0.41.1             pyhd8ed1ab_0    conda-forge
xz                        5.2.6                h166bdaf_0    conda-forge
zlib                      1.2.13               hd590300_5    conda-forge
zstd                      1.5.2                hfc55251_7    conda-forge

Describe the bug I am trying to use pigeon classify, and though my file looks like it is formatted given examples on isoseq.how, it is telling me there is some formatting error.

When I look at my file to see if there is a missing tab or something of the sort, it looks like the tabs are correctly situated. image

Above is a vim forward slash on the tab character.

Error message

| 20240202 11:42:58.030 | FATAL | pigeon classify ERROR: error loading reference annotations for reference: CM061257.1
GFF/GTF file error, improperly formatted record
  reason : missing gene_name attribute
  record : CM061257.1   02.11.2023.17_33_46.HLtadBra3_v1.gp     transcript      104022  137695  .       +       .       gene_id "ENST00000310340.PIGG.4"; transcript_id "ENST00000310340.PIGG.4";

This implies to me that it expects to see "gene_name"? However this format seems to me the same as the compatible file in isoseq.how

chr1    ENSEMBL transcript      17369   17436   .       -       .       gene_id "ENSG00000278267.1"; transcript_id "ENST00000619216.1"; gene_type "mi
RNA"; gene_status "KNOWN"; gene_name "MIR6859-1"; transcript_type "miRNA"; transcript_status "KNOWN"; transcript_name "MIR6859-1-201"; level 3; tag "
basic"; transcript_support_level "NA";

So I wouldn't expect this to fail, or run into this issue. To Reproduce I ran classify module on such a file, let me know files to provide for reproduction.

Expected behavior Provides output from the classify script

sojichld commented 7 months ago

found error

armintoepfer commented 7 months ago

It would be helpful if you document what was your problem and how you've fixed it.

sojichld commented 7 months ago

It was missing the gene_name field. In the example posted to the isoseq.how, while it begins similarly to my file, there is indeed a gene_name field in column 9, just a few entries farther down. The gtf that I was using did not have this field, and I overlooked that this is mentioned as one of the three required fields.

I used the following awk script to add a gene_name field, identical to what is listed in the transcript id field (subfield of $9) to each line of my file, and I was able to proceed after that:

awk 'BEGIN{FS=OFS="\t"} {if (split($9, arr, "transcript_id \"") > 1) {split(arr[2], id, "\""); $9 = $9 " gene_name \"" id[1] "\"";} print;}' input.gtf > output.gtf