Magdoll / SQANTI2

SQANTI2 is now replaced by SQANTI3. Please go to: https://github.com/ConesaLab/SQANTI3
Other
38 stars 15 forks source link

sqanti2_qc.py fails with assertion error #58

Closed vkkodali closed 4 years ago

vkkodali commented 4 years ago

When I try to run sqanti2_qc.py on a gtf generated using TALON pipeline, I am running into this error:

**** Running SQANTI2...
Traceback (most recent call last):
  File "/opt/SQANTI2/sqanti_qc2.py", line 2198, in <module>
    main()
  File "/opt/SQANTI2/sqanti_qc2.py", line 2193, in main
    split_dirs = split_input_run(args)
  File "/opt/SQANTI2/sqanti_qc2.py", line 1971, in split_input_run
    recs = [r for r in collapseGFFReader(args.isoforms)]
  File "/opt/SQANTI2/sqanti_qc2.py", line 1971, in <listcomp>
    recs = [r for r in collapseGFFReader(args.isoforms)]
  File "/usr/local/lib/python3.7/dist-packages/cupcake-10.0.1-py3.7-linux-x86_64.egg/cupcake/io/GFF.py", line 405, in __next__
    return self.read()            
  File "/usr/local/lib/python3.7/dist-packages/cupcake-10.0.1-py3.7-linux-x86_64.egg/cupcake/io/GFF.py", line 562, in read
    assert raw[2] == 'transcript'
AssertionError

Here are the parameters I am using:

Version     7.3.2
Input       test.gtf
Annotation  gencode.v33.annotation.ncbi_seqids.gtf
Genome      GCF_000001405.39_GRCh38.p13_genomic.fna
Aligner     minimap2
FLCount     NA
Expression  NA
Junction    intropolis.v1.hg19_with_liftover_to_hg38.tsv.min_count_10.modified
CagePeak    hg38.cage_peak_phase1and2combined_coord.bed
PolyA       human.polyA.list.txt
PolyAPeak   hg38.cage_peak_phase1and2combined_coord.bed
IsFusion    False

A test.gtf.zip file is attached.

Magdoll commented 4 years ago

Hi @vkkodali ,

SQANTI2 actually expects GFF3 format. You can convert your input using gffread below:

gffread -T test.gtf > test.gff3

And you can see the difference after the conversion. It basically takes out the "gene" records.

NC_000001.11    TALON   transcript  14404   20079   .   -   .   transcript_id "TALONT000214958"; gene_id "ENSG00000227232.5"; gene_name "WASH7P";
NC_000001.11    TALON   exon    14404   14829   .   -   .   transcript_id "TALONT000214958"; gene_id "ENSG00000227232.5"; gene_name "WASH7P";
NC_000001.11    TALON   exon    14970   15038   .   -   .   transcript_id "TALONT000214958"; gene_id "ENSG00000227232.5"; gene_name "WASH7P";
NC_000001.11    TALON   exon    15796   15947   .   -   .   transcript_id "TALONT000214958"; gene_id "ENSG00000227232.5"; gene_name "WASH7P";
NC_000001.11    TALON   exon    16607   16765   .   -   .   transcript_id "TALONT000214958"; gene_id "ENSG00000227232.5"; gene_name "WASH7P";
NC_000001.11    TALON   exon    16858   17055   .   -   .   transcript_id "TALONT000214958"; gene_id "ENSG00000227232.5"; gene_name "WASH7P";
NC_000001.11    TALON   exon    17233   17742   .   -   .   transcript_id "TALONT000214958"; gene_id "ENSG00000227232.5"; gene_name "WASH7P";
NC_000001.11    TALON   exon    17915   18061   .   -   .   transcript_id "TALONT000214958"; gene_id "ENSG00000227232.5"; gene_name "WASH7P";
NC_000001.11    TALON   exon    18268   18369   .   -   .   transcript_id "TALONT000214958"; gene_id "ENSG00000227232.5"; gene_name "WASH7P";
NC_000001.11    TALON   exon    18501   18554   .   -   .   transcript_id "TALONT000214958"; gene_id "ENSG00000227232.5"; gene_name "WASH7P";
NC_000001.11    TALON   exon    18913   20079   .   -   .   transcript_id "TALONT000214958"; gene_id "ENSG00000227232.5"; gene_name "WASH7P";
NC_000001.11    TALON   transcript  14404   20274   .   -   .   transcript_id "TALONT000214910"; gene_id "ENSG00000227232.5"; gene_name "WASH7P";
NC_000001.11    TALON   exon    14404   14829   .   -   .   transcript_id "TALONT000214910"; gene_id "ENSG00000227232.5"; gene_name "WASH7P";
NC_000001.11    TALON   exon    14970   15038   .   -   .   transcript_id "TALONT000214910"; gene_id "ENSG00000227232.5"; gene_name "WASH7P";
NC_000001.11    TALON   exon    15796   15947   .   -   .   transcript_id "TALONT000214910"; gene_id "ENSG00000227232.5"; gene_name "WASH7P";
NC_000001.11    TALON   exon    16607   16765   .   -   .   transcript_id "TALONT000214910"; gene_id "ENSG00000227232.5"; gene_name "WASH7P";
NC_000001.11    TALON   exon    16858   17055   .   -   .   transcript_id "TALONT000214910"; gene_id "ENSG00000227232.5"; gene_name "WASH7P";
NC_000001.11    TALON   exon    17233   17742   .   -   .   transcript_id "TALONT000214910"; gene_id "ENSG00000227232.5"; gene_name "WASH7P";
NC_000001.11    TALON   exon    17915   18061   .   -   .   transcript_id "TALONT000214910"; gene_id "ENSG00000227232.5"; gene_name "WASH7P";
NC_000001.11    TALON   exon    18268   20274   .   -   .   transcript_id "TALONT000214910"; gene_id "ENSG00000227232.5"; gene_name "WASH7P";
CrazyHsu commented 4 years ago

Hi Liz I still came out the same issue that assert raw[2] == 'transcript' after converted my gtf to gff3 using gffread -T test.gtf > test.gff3, the gffread was installed via conda and the version is 0.11.7 here is my gtf format:

1       iFLAS   transcript      44297   49138   .       +       .       gene_id "transcript/30091"; transcript_id "transcript/30091"; 
1       iFLAS   exon    44297   44947   .       +       .       gene_id "transcript/30091"; transcript_id "transcript/30091"; exon_number "1"; exon_id "transcript/30091.1";
1       iFLAS   CDS     44297   44947   .       +       0       gene_id "transcript/30091"; transcript_id "transcript/30091"; exon_number "1"; exon_id "transcript/30091.1";
1       iFLAS   start_codon     44297   44299   .       +       0       gene_id "transcript/30091"; transcript_id "transcript/30091"; exon_number "1"; exon_id "transcript/30091.1";
1       iFLAS   transcript      44297   49139   .       +       .       gene_id "transcript/31099"; transcript_id "transcript/31099"; 
1       iFLAS   exon    44297   44947   .       +       .       gene_id "transcript/31099"; transcript_id "transcript/31099"; exon_number "1"; exon_id "transcript/31099.1";
1       iFLAS   CDS     44297   44947   .       +       0       gene_id "transcript/31099"; transcript_id "transcript/31099"; exon_number "1"; exon_id "transcript/31099.1";
1       iFLAS   start_codon     44297   44299   .       +       0       gene_id "transcript/31099"; transcript_id "transcript/31099"; exon_number "1"; exon_id "transcript/31099.1";

and this is my gff3 file after converting:

1       iFLAS   transcript      44297   49138   .       +       .       transcript_id "transcript/30091"; gene_id "transcript/30091";
1       iFLAS   exon    44297   44947   .       +       .       transcript_id "transcript/30091"; gene_id "transcript/30091";
1       iFLAS   exon    45666   45803   .       +       .       transcript_id "transcript/30091"; gene_id "transcript/30091";
1       iFLAS   exon    45888   46133   .       +       .       transcript_id "transcript/30091"; gene_id "transcript/30091";
1       iFLAS   exon    46229   46342   .       +       .       transcript_id "transcript/30091"; gene_id "transcript/30091";
1       iFLAS   exon    46451   46633   .       +       .       transcript_id "transcript/30091"; gene_id "transcript/30091";
1       iFLAS   exon    47045   47262   .       +       .       transcript_id "transcript/30091"; gene_id "transcript/30091";
1       iFLAS   exon    47650   49138   .       +       .       transcript_id "transcript/30091"; gene_id "transcript/30091";
1       iFLAS   CDS     44297   44947   .       +       0       transcript_id "transcript/30091"; gene_id "transcript/30091";
1       iFLAS   CDS     45666   45803   .       +       0       transcript_id "transcript/30091"; gene_id "transcript/30091";
1       iFLAS   CDS     45888   46133   .       +       0       transcript_id "transcript/30091"; gene_id "transcript/30091";
1       iFLAS   CDS     46229   46342   .       +       0       transcript_id "transcript/30091"; gene_id "transcript/30091";
1       iFLAS   CDS     46451   46633   .       +       0       transcript_id "transcript/30091"; gene_id "transcript/30091";
1       iFLAS   CDS     47045   47262   .       +       0       transcript_id "transcript/30091"; gene_id "transcript/30091";
1       iFLAS   CDS     47650   49138   .       +       1       transcript_id "transcript/30091"; gene_id "transcript/30091";

It seems like that the assert raw[2] == 'transcript' line only pass the line which the feature filed is transcript, so the exon line can't pass the criterion, and then the error is thrown out. How do you think about it?

Magdoll commented 4 years ago

@CrazyHsu , Oh, actually, I may have updated Cupcake to a new version (v11.0.0) that deals with this. It was related to the last column order of whether transcript_id or gene_id is listed first. Can you please update Cupcake (which is used to read the GFF3 file) and report back?

CrazyHsu commented 4 years ago

Yes, Liz, I have tried Cupcake(v11.0.0), it worked as expected. But it is under the python3 environment, can i use SQANTI2 with Cupcake (Py2_v8.7.x)?

Magdoll commented 4 years ago

Hi @CrazyHsu SQANTI2 latest versions are all only for Python 3. I do recommend switching to Py3 completely as I have stopped supporting Py2.

CrazyHsu commented 4 years ago

OK, Liz, i will turn to Py3, thanks for your quick reply!