PacificBiosciences / pbbioconda

PacBio Secondary Analysis Tools on Bioconda. Contains list of PacBio packages available via conda.
BSD 3-Clause Clear License
Pigeon classify fails #645

Pigeon classify fails #645

Closed sojichld closed 7 months ago

sojichld commented 7 months ago

Operating system MAC (but I am using through an HPC so redhat linux)

Package name pigeon 1.2.0

Conda environment

Describe the bug I am trying to use pigeon classify, and though my file looks like it is formatted given examples on, it is telling me there is some formatting error.

When I look at my file to see if there is a missing tab or something of the sort, it looks like the tabs are correctly situated. image

Above is a vim forward slash on the tab character.

Error message

| 20240202 11:42:58.030 | FATAL | pigeon classify ERROR: error loading reference annotations for reference: CM061257.1
GFF/GTF file error, improperly formatted record
  reason : missing gene_name attribute
  record : CM061257.1     transcript      104022  137695  .       +       .       gene_id "ENST00000310340.PIGG.4"; transcript_id "ENST00000310340.PIGG.4";

This implies to me that it expects to see "gene_name"? However this format seems to me the same as the compatible file in

chr1    ENSEMBL transcript      17369   17436   .       -       .       gene_id "ENSG00000278267.1"; transcript_id "ENST00000619216.1"; gene_type "mi
RNA"; gene_status "KNOWN"; gene_name "MIR6859-1"; transcript_type "miRNA"; transcript_status "KNOWN"; transcript_name "MIR6859-1-201"; level 3; tag "
basic"; transcript_support_level "NA";

So I wouldn't expect this to fail, or run into this issue. To Reproduce I ran classify module on such a file, let me know files to provide for reproduction.

Expected behavior Provides output from the classify script

sojichld commented 7 months ago

found error

armintoepfer commented 7 months ago

It would be helpful if you document what was your problem and how you've fixed it.

sojichld commented 7 months ago

It was missing the gene_name field. In the example posted to the, while it begins similarly to my file, there is indeed a gene_name field in column 9, just a few entries farther down. The gtf that I was using did not have this field, and I overlooked that this is mentioned as one of the three required fields.

I used the following awk script to add a gene_name field, identical to what is listed in the transcript id field (subfield of $9) to each line of my file, and I was able to proceed after that:

awk 'BEGIN{FS=OFS="\t"} {if (split($9, arr, "transcript_id \"") > 1) {split(arr[2], id, "\""); $9 = $9 " gene_name \"" id[1] "\"";} print;}' input.gtf > output.gtf