ablab / IsoQuant

Transcript discovery and quantification with long RNA reads (Nanopores and PacBio)
Other
129 stars 11 forks source link

unexpected keyword argument 'indexname' #188

Closed francicco closed 6 days ago

francicco commented 2 months ago

Hi,

I was testing the new release of IsoQuant and I found this error:

2024-05-11 15:15:01,206 - INFO - Running IsoQuant version 3.4.1
2024-05-11 15:15:01,527 - INFO - Novel unspliced transcripts will not be reported, set --report_novel_unspliced true to discover them
2024-05-11 15:15:01,527 - INFO -  === IsoQuant pipeline started === 
2024-05-11 15:15:01,527 - INFO - gffutils version: 0.10.1
2024-05-11 15:15:01,528 - INFO - pysam version: 0.22.1
2024-05-11 15:15:01,528 - INFO - pyfaidx version: 0.6.3.1
2024-05-11 15:15:01,528 - INFO - Checking input gene annotation
2024-05-11 15:15:07,260 - INFO - Gene annotation seems to be correct
2024-05-11 15:15:07,267 - INFO - Converting gene annotation file to .db format (takes a while)...
2024-05-11 15:16:01,265 - INFO - Gene database written to /user/work/tk19812/HeliconiniiProject/scRNA-IsoSeq/test/IsoQuant2.4.Diul.PCGs/Diul.v1.2.annotation.CAT.db
2024-05-11 15:16:01,267 - INFO - Provide this database next time to avoid excessive conversion
2024-05-11 15:16:01,268 - INFO - Loading gene database from /user/work/tk19812/HeliconiniiProject/scRNA-IsoSeq/test/IsoQuant2.4.Diul.PCGs/Diul.v1.2.annotation.CAT.db
2024-05-11 15:16:01,269 - INFO - Loading reference genome from /user/work/tk19812/HeliconiniiProject/CompleteAssemblies/AssembledGenomes/Diul.assembly.v1.1.fasta
2024-05-11 15:16:01,270 - CRITICAL - IsoQuant failed with the following error, please, submit this issue to https://github.com/ablab/IsoQuant/issuesTraceback (most recent call last):
  File "/user/work/tk19812/scWorkshop/miniforge3/envs/isoquant3.4/bin/isoquant.py", line 808, in <module>
    main(sys.argv[1:])
  File "/user/work/tk19812/scWorkshop/miniforge3/envs/isoquant3.4/bin/isoquant.py", line 802, in main
    run_pipeline(args)
  File "/user/work/tk19812/scWorkshop/miniforge3/envs/isoquant3.4/bin/isoquant.py", line 754, in run_pipeline
    dataset_processor = DatasetProcessor(args)
  File "/user/work/tk19812/scWorkshop/miniforge3/envs/isoquant3.4/share/isoquant-3.4.1-0/src/dataset_processor.py", line 403, in __init__
    self.reference_record_dict = Fasta(self.args.reference, indexname=args.fai_file_name)
TypeError: __init__() got an unexpected keyword argument 'indexname'

Cheers F

andrewprzh commented 2 months ago

@francicco

Seems like an issue with pyfaidx, could you update it to 0.7 or any higher version?

Best Andrey

francicco commented 2 months ago

Hi Andrey,

It looks like I was able to fix the bug. Now it looks like my gft/gff are not correctly identified do you have a gtf file that I can use as template to see how mine should be formatted?

I tried different ways to convert my gff3 file but none worked.

Cheers F

francicco commented 2 months ago

is this one representative for the new Isoquant?

andrewprzh commented 2 months ago

@francicco

The GTF processing has not changed since the previous version. I have only added some preliminary checks that can be turned off with --no_gtf_check. The GTF you linked is acceptable, yes.

Do you get any error messages or warnings? Do you have a log for the failed run?

Best Andrey

francicco commented 2 months ago

Hi @andrewprzh,

I converted the gff3 with a script of mine and it works. The actual job in the queue. it will take some time I guess. Just to double-check, what I'm doing is giving ONT reads mapped with minimap2 as BAM file to isoquant together with a reference annotation that I'd like to fix and update.

Is that fine? Or do I need to collapse my reads somehow?

Thanks a lot F

andrewprzh commented 2 months ago

@francicco

Everything seems right, no pre-processing/collapsing is needed, IsoQuant does everything on its own. Moreover, you can give IsoQuant FASTQ files and it will run minimap2 with optimal parameters automatically.

Let me know if you encounter any issues during the run.

Best Andrey

francicco commented 2 months ago

Ok, one job started. I'm getting this info:

2024-05-15 02:59:58,224 - INFO - primary: 14240592
2024-05-15 02:59:58,224 - INFO - secondary: 875089
2024-05-15 02:59:58,225 - INFO - supplementary: 663575
2024-05-15 02:59:58,225 - INFO - unaligned: 5245182
2024-05-15 02:59:58,225 - INFO - Resolving multimappers
2024-05-15 03:00:09,269 - INFO - Finishing read assignment, total assignments 14077454, polyA percentage 61.0
2024-05-15 03:00:17,692 - INFO - Read assignments files saved to IsoQuant2.4.Hmel.PCGs/HmelIsoSeq/aux/HmelIsoSeq.save*. 
2024-05-15 03:00:17,692 - INFO - To keep these intermediate files for debug purposes use --keep_tmp flag
2024-05-15 03:00:17,710 - INFO - Total alignments used for analysis: 14077454, polyA tail detected in 8593063 (61.0%)
2024-05-15 03:00:17,711 - INFO - Processing assigned reads HmelIsoSeq
2024-05-15 03:00:17,711 - INFO - Transcript models construction is turned on
2024-05-15 03:00:17,715 - INFO - Transcript construction options:
2024-05-15 03:00:17,716 - INFO -   Novel monoexonic transcripts will be reported: no
2024-05-15 03:00:17,716 - INFO -   PolyA tails are required for multi-exon transcripts to be reported: no
2024-05-15 03:00:17,716 - INFO -   PolyA tails are required for 2-exon transcripts to be reported: no
2024-05-15 03:00:17,716 - INFO -   PolyA tails are required for known monoexon transcripts to be reported: no
2024-05-15 03:00:17,716 - INFO -   PolyA tails are required for novel monoexon transcripts to be reported: yes
2024-05-15 03:00:17,716 - INFO -   Splice site reporting level: only_stranded

Are these good stats? lots of NOs, should I turn some flags on? Cheers F

francicco commented 2 months ago

Hi @andrewprzh,

One of the two alignments just finished. I wanted to run some stats on HmelIsoSeq.extended_annotation.gtf, which I think is the final annotation, with new, old, and fixed genes and transcripts. Unfortunately, I'm getting a lot of errors giving it to multiple tools for conversion/parsing, including SQANTI which returns errors. I think it has some malformed problems.

I'm going to write a script trying to fix it. I'll let you know. Cheers F

andrewprzh commented 2 months ago

Dear @francicco

Are these good stats? lots of NOs, should I turn some flags on?

IsoQuant decided to set these flags due to polyA tail percentage. If you'd like your IsoQuant to report novel transcript models only with confirmed polyA sites you can set --polya_requirement always.

Could you send me the errors you get from these tools and the GTF file itself, I'm eager to fix the problem if there is one. I regularly check IsoQuant output with gffcompare and have not detected any problems.

Best Andrey

francicco commented 2 months ago

Hi @andrewprzh,

So, if I try with gffcompare for example I'm getting this:

gffcompare -r $MyRefgff3 extended_annotation.gtf
  29275 reference transcripts loaded.
  2 duplicate reference transcripts discarded.
Error: no valid ID found for GFF record

I noticed that one of the problem is a double " nex to transcript_id, such as in a line like this one:

Hmel200004o     annotation      transcript      1       10141   .       +       .       gene_id "transcript:Hmel.Hmel200004oG1"; transcript_id ""transcript:Hmel.Hmel200004oG1.3"; Canonical "True"; exons "10"; score "5.64"; 
Hmel200004o     annotation      start_codon     1       3       .       +       .       gene_id "transcript:Hmel.Hmel200004oG1"; transcript_id ""transcript:Hmel.Hmel200004oG1.3"; exon "1"; exon_id "Hmel200004o.110";
Hmel200004o     annotation      CDS     1       58      .       +       .       gene_id "transcript:Hmel.Hmel200004oG1"; transcript_id ""transcript:Hmel.Hmel200004oG1.3"; exon "2"; exon_id "Hmel200004o.89";
Hmel200004o     annotation      exon    1       58      .       +       .       gene_id "transcript:Hmel.Hmel200004oG1"; transcript_id ""transcript:Hmel.Hmel200004oG1.3"; exon "3"; exon_id "Hmel200004o.89";
Hmel200004o     annotation      CDS     1729    1818    .       +       .       gene_id "transcript:Hmel.Hmel200004oG1"; transcript_id ""transcript:Hmel.Hmel200004oG1.3"; exon "4"; exon_id "Hmel200004o.123";
Hmel200004o     annotation      exon    1729    1818    .       +       .       gene_id "transcript:Hmel.Hmel200004oG1"; transcript_id ""transcript:Hmel.Hmel200004oG1.3"; exon "5"; exon_id "Hmel200004o.123";
Hmel200004o     annotation      CDS     2919    3063    .       +       .       gene_id "transcript:Hmel.Hmel200004oG1"; transcript_id ""transcript:Hmel.Hmel200004oG1.3"; exon "6"; exon_id "Hmel200004o.103";
Hmel200004o     annotation      exon    2919    3063    .       +       .       gene_id "transcript:Hmel.Hmel200004oG1"; transcript_id ""transcript:Hmel.Hmel200004oG1.3"; exon "7"; exon_id "Hmel200004o.103";
Hmel200004o     annotation      CDS     3349    3440    .       +       .       gene_id "transcript:Hmel.Hmel200004oG1"; transcript_id ""transcript:Hmel.Hmel200004oG1.3"; exon "8"; exon_id "Hmel200004o.93";
Hmel200004o     annotation      exon    3349    3440    .       +       .       gene_id "transcript:Hmel.Hmel200004oG1"; transcript_id ""transcript:Hmel.Hmel200004oG1.3"; exon "9"; exon_id "Hmel200004o.93";
Hmel200004o     annotation      CDS     3811    3964    .       +       .       gene_id "transcript:Hmel.Hmel200004oG1"; transcript_id ""transcript:Hmel.Hmel200004oG1.3"; exon "10"; exon_id "Hmel200004o.94";
Hmel200004o     annotation      exon    3811    3964    .       +       .       gene_id "transcript:Hmel.Hmel200004oG1"; transcript_id ""transcript:Hmel.Hmel200004oG1.3"; exon "11"; exon_id "Hmel200004o.94";
Hmel200004o     annotation      CDS     4330    4525    .       +       .       gene_id "transcript:Hmel.Hmel200004oG1"; transcript_id ""transcript:Hmel.Hmel200004oG1.3"; exon "12"; exon_id "Hmel200004o.95";
Hmel200004o     annotation      exon    4330    4525    .       +       .       gene_id "transcript:Hmel.Hmel200004oG1"; transcript_id ""transcript:Hmel.Hmel200004oG1.3"; exon "13"; exon_id "Hmel200004o.95";
Hmel200004o     annotation      CDS     4803    4895    .       +       .       gene_id "transcript:Hmel.Hmel200004oG1"; transcript_id ""transcript:Hmel.Hmel200004oG1.3"; exon "14"; exon_id "Hmel200004o.96";
Hmel200004o     annotation      exon    4803    4895    .       +       .       gene_id "transcript:Hmel.Hmel200004oG1"; transcript_id ""transcript:Hmel.Hmel200004oG1.3"; exon "15"; exon_id "Hmel200004o.96";
Hmel200004o     annotation      CDS     5422    5550    .       +       .       gene_id "transcript:Hmel.Hmel200004oG1"; transcript_id ""transcript:Hmel.Hmel200004oG1.3"; exon "16"; exon_id "Hmel200004o.97";
Hmel200004o     annotation      exon    5422    5550    .       +       .       gene_id "transcript:Hmel.Hmel200004oG1"; transcript_id ""transcript:Hmel.Hmel200004oG1.3"; exon "17"; exon_id "Hmel200004o.97";
Hmel200004o     annotation      CDS     6560    6749    .       +       .       gene_id "transcript:Hmel.Hmel200004oG1"; transcript_id ""transcript:Hmel.Hmel200004oG1.3"; exon "18"; exon_id "Hmel200004o.98";
Hmel200004o     annotation      exon    6560    6749    .       +       .       gene_id "transcript:Hmel.Hmel200004oG1"; transcript_id ""transcript:Hmel.Hmel200004oG1.3"; exon "19"; exon_id "Hmel200004o.98";
Hmel200004o     annotation      stop_codon      6747    6749    .       +       .       gene_id "transcript:Hmel.Hmel200004oG1"; transcript_id ""transcript:Hmel.Hmel200004oG1.3"; exon "20"; exon_id "Hmel200004o.111";
Hmel200004o     annotation      UTR     8334    10141   .       +       .       gene_id "transcript:Hmel.Hmel200004oG1"; transcript_id ""transcript:Hmel.Hmel200004oG1.3"; exon "21"; exon_id "Hmel200004o.112";

Once I fix it with a simple sed command now gffcompare runs. But I suspect there might be more! I'll have to investigate further.

Cheers F

francicco commented 2 months ago

Hi,

I think I found what bothers Mikado on top of the double ". It seems like it is the exons "X" tag. Cheers F

andrewprzh commented 6 days ago

Thanks a lot for letting me know and sorry for the late reply.

Will close this issue for now.