Closed vkkodali closed 4 years ago
Hi @vkkodali ,
SQANTI2 actually expects GFF3 format. You can convert your input using gffread below:
gffread -T test.gtf > test.gff3
And you can see the difference after the conversion. It basically takes out the "gene" records.
NC_000001.11 TALON transcript 14404 20079 . - . transcript_id "TALONT000214958"; gene_id "ENSG00000227232.5"; gene_name "WASH7P";
NC_000001.11 TALON exon 14404 14829 . - . transcript_id "TALONT000214958"; gene_id "ENSG00000227232.5"; gene_name "WASH7P";
NC_000001.11 TALON exon 14970 15038 . - . transcript_id "TALONT000214958"; gene_id "ENSG00000227232.5"; gene_name "WASH7P";
NC_000001.11 TALON exon 15796 15947 . - . transcript_id "TALONT000214958"; gene_id "ENSG00000227232.5"; gene_name "WASH7P";
NC_000001.11 TALON exon 16607 16765 . - . transcript_id "TALONT000214958"; gene_id "ENSG00000227232.5"; gene_name "WASH7P";
NC_000001.11 TALON exon 16858 17055 . - . transcript_id "TALONT000214958"; gene_id "ENSG00000227232.5"; gene_name "WASH7P";
NC_000001.11 TALON exon 17233 17742 . - . transcript_id "TALONT000214958"; gene_id "ENSG00000227232.5"; gene_name "WASH7P";
NC_000001.11 TALON exon 17915 18061 . - . transcript_id "TALONT000214958"; gene_id "ENSG00000227232.5"; gene_name "WASH7P";
NC_000001.11 TALON exon 18268 18369 . - . transcript_id "TALONT000214958"; gene_id "ENSG00000227232.5"; gene_name "WASH7P";
NC_000001.11 TALON exon 18501 18554 . - . transcript_id "TALONT000214958"; gene_id "ENSG00000227232.5"; gene_name "WASH7P";
NC_000001.11 TALON exon 18913 20079 . - . transcript_id "TALONT000214958"; gene_id "ENSG00000227232.5"; gene_name "WASH7P";
NC_000001.11 TALON transcript 14404 20274 . - . transcript_id "TALONT000214910"; gene_id "ENSG00000227232.5"; gene_name "WASH7P";
NC_000001.11 TALON exon 14404 14829 . - . transcript_id "TALONT000214910"; gene_id "ENSG00000227232.5"; gene_name "WASH7P";
NC_000001.11 TALON exon 14970 15038 . - . transcript_id "TALONT000214910"; gene_id "ENSG00000227232.5"; gene_name "WASH7P";
NC_000001.11 TALON exon 15796 15947 . - . transcript_id "TALONT000214910"; gene_id "ENSG00000227232.5"; gene_name "WASH7P";
NC_000001.11 TALON exon 16607 16765 . - . transcript_id "TALONT000214910"; gene_id "ENSG00000227232.5"; gene_name "WASH7P";
NC_000001.11 TALON exon 16858 17055 . - . transcript_id "TALONT000214910"; gene_id "ENSG00000227232.5"; gene_name "WASH7P";
NC_000001.11 TALON exon 17233 17742 . - . transcript_id "TALONT000214910"; gene_id "ENSG00000227232.5"; gene_name "WASH7P";
NC_000001.11 TALON exon 17915 18061 . - . transcript_id "TALONT000214910"; gene_id "ENSG00000227232.5"; gene_name "WASH7P";
NC_000001.11 TALON exon 18268 20274 . - . transcript_id "TALONT000214910"; gene_id "ENSG00000227232.5"; gene_name "WASH7P";
Hi Liz
I still came out the same issue that assert raw[2] == 'transcript'
after converted my gtf to gff3 using gffread -T test.gtf > test.gff3
, the gffread was installed via conda and the version is 0.11.7
here is my gtf format:
1 iFLAS transcript 44297 49138 . + . gene_id "transcript/30091"; transcript_id "transcript/30091";
1 iFLAS exon 44297 44947 . + . gene_id "transcript/30091"; transcript_id "transcript/30091"; exon_number "1"; exon_id "transcript/30091.1";
1 iFLAS CDS 44297 44947 . + 0 gene_id "transcript/30091"; transcript_id "transcript/30091"; exon_number "1"; exon_id "transcript/30091.1";
1 iFLAS start_codon 44297 44299 . + 0 gene_id "transcript/30091"; transcript_id "transcript/30091"; exon_number "1"; exon_id "transcript/30091.1";
1 iFLAS transcript 44297 49139 . + . gene_id "transcript/31099"; transcript_id "transcript/31099";
1 iFLAS exon 44297 44947 . + . gene_id "transcript/31099"; transcript_id "transcript/31099"; exon_number "1"; exon_id "transcript/31099.1";
1 iFLAS CDS 44297 44947 . + 0 gene_id "transcript/31099"; transcript_id "transcript/31099"; exon_number "1"; exon_id "transcript/31099.1";
1 iFLAS start_codon 44297 44299 . + 0 gene_id "transcript/31099"; transcript_id "transcript/31099"; exon_number "1"; exon_id "transcript/31099.1";
and this is my gff3 file after converting:
1 iFLAS transcript 44297 49138 . + . transcript_id "transcript/30091"; gene_id "transcript/30091";
1 iFLAS exon 44297 44947 . + . transcript_id "transcript/30091"; gene_id "transcript/30091";
1 iFLAS exon 45666 45803 . + . transcript_id "transcript/30091"; gene_id "transcript/30091";
1 iFLAS exon 45888 46133 . + . transcript_id "transcript/30091"; gene_id "transcript/30091";
1 iFLAS exon 46229 46342 . + . transcript_id "transcript/30091"; gene_id "transcript/30091";
1 iFLAS exon 46451 46633 . + . transcript_id "transcript/30091"; gene_id "transcript/30091";
1 iFLAS exon 47045 47262 . + . transcript_id "transcript/30091"; gene_id "transcript/30091";
1 iFLAS exon 47650 49138 . + . transcript_id "transcript/30091"; gene_id "transcript/30091";
1 iFLAS CDS 44297 44947 . + 0 transcript_id "transcript/30091"; gene_id "transcript/30091";
1 iFLAS CDS 45666 45803 . + 0 transcript_id "transcript/30091"; gene_id "transcript/30091";
1 iFLAS CDS 45888 46133 . + 0 transcript_id "transcript/30091"; gene_id "transcript/30091";
1 iFLAS CDS 46229 46342 . + 0 transcript_id "transcript/30091"; gene_id "transcript/30091";
1 iFLAS CDS 46451 46633 . + 0 transcript_id "transcript/30091"; gene_id "transcript/30091";
1 iFLAS CDS 47045 47262 . + 0 transcript_id "transcript/30091"; gene_id "transcript/30091";
1 iFLAS CDS 47650 49138 . + 1 transcript_id "transcript/30091"; gene_id "transcript/30091";
It seems like that the assert raw[2] == 'transcript'
line only pass the line which the feature filed is transcript, so the exon line can't pass the criterion, and then the error is thrown out.
How do you think about it?
@CrazyHsu , Oh, actually, I may have updated Cupcake to a new version (v11.0.0) that deals with this. It was related to the last column order of whether transcript_id or gene_id is listed first. Can you please update Cupcake (which is used to read the GFF3 file) and report back?
Yes, Liz, I have tried Cupcake(v11.0.0), it worked as expected. But it is under the python3 environment, can i use SQANTI2 with Cupcake (Py2_v8.7.x)?
Hi @CrazyHsu SQANTI2 latest versions are all only for Python 3. I do recommend switching to Py3 completely as I have stopped supporting Py2.
OK, Liz, i will turn to Py3, thanks for your quick reply!
When I try to run sqanti2_qc.py on a gtf generated using TALON pipeline, I am running into this error:
Here are the parameters I am using:
A test.gtf.zip file is attached.