Xinglab / isoCirc

isoCirc
GNU General Public License v3.0
10 stars 4 forks source link

gtfToGenePred fails on gene with CDS but no valid frames #9

Closed Asa-Budnick closed 1 year ago

Asa-Budnick commented 1 year ago

Hello! I am using the isocirc method and am running into an error that seems associated with the GTF file:

Here is the relevant portion of my output: == 22:57:18-Jan-03-2023 == [Mapping] Mapping consensus sequence to genome done! == 22:57:18-Jan-03-2023 == [Classifying] Classifying consensus alignment ... == 22:57:18-Jan-03-2023 == [classify_bam_core] Processing /gpfs_common/share01/hwsedero/abudnic/221220_Lj_Leaf_Isocirc/isocirc_output/cons.fa.sam ... == 22:57:21-Jan-03-2023 == [classify_bam_core] 100000 BAM records done ... == 22:57:22-Jan-03-2023 == [classify_bam_core] Processing /gpfs_common/share01/hwsedero/abudnic/221220_Lj_Leaf_Isocirc/isocirc_output/cons.fa.sam done. == 22:57:22-Jan-03-2023 == [Classifying] Classifying consensus alignment done! == 22:57:27-Jan-03-2023 == [gtfToGenePred] gtfToGenePred -genePredExt -ignoreGroupsWithoutExons /gpfs_common/share01/hwsedero/abudnic/Lotusjaponicus_Gifu_v1.3_predictedProteins.agat.convert.gtf /gpfs_common/share01/hwsedero/abudnic/221220_Lj_Leaf_Isocirc/isocirc_output/Lotusjaponicus_Gifu_v1.3_predictedProteins.agat.convert.gtf.gene_pred Error: exonFrames field is being added, but I found a gene (LotjaGi1g1v0068450.1) with CDS but no valid frames. This can happen if program is invoked with -genePredExt but no valid frames are given in the file. If the 8th field of GFF/GTF file is always a placeholder, then don't use -genePredExt. == 22:57:29-Jan-03-2023 == [gtfToGenePred] Error: gtfToGenePred -genePredExt -ignoreGroupsWithoutExons /gpfs_common/share01/hwsedero/abudnic/Lotusjaponicus_Gifu_v1.3_predictedProteins.agat.convert.gtf /gpfs_common/share01/hwsedero/abudnic/221220_Lj_Leaf_Isocirc/isocirc_output/Lotusjaponicus_Gifu_v1.3_predictedProteins.agat.convert.gtf.gene_pred

I found a similar issue here that the developer seems to have solved. Please let me know if you have any suggestions. Thanks :D

yangao07 commented 1 year ago

This may be solved by removing "-genePredExt" from the gtfToGenePred command. Since I don't have your original GTF annotation file, can you try this command and let me know if it works?

gtfToGenePred -ignoreGroupsWithoutExons /gpfs_common/share01/hwsedero/abudnic/Lotusjaponicus_Gifu_v1.3_predictedProteins.agat.convert.gtf /gpfs_common/share01/hwsedero/abudnic/221220_Lj_Leaf_Isocirc/isocirc_output/Lotusjaponicus_Gifu_v1.3_predictedProteins.agat.convert.gtf.gene_pred

If it works, I will make the change to the isocirc codes and push it to the repo/pypi.

Thanks, Yan

Asa-Budnick commented 1 year ago

Hello Yan,

Thank you for your reply and I'm sorry for the delayed response. The code above does work to generate an output and would love for it to be implemented into the isocirc repo if that's possible. Thank you!

yangao07 commented 1 year ago

Pushed it to the repo.

Asa-Budnick commented 1 year ago

Hello Yan,

I attempted to run the updated repo with the test data as well as with my own data and got this index error for both. Could you give this a look as well?

Thank you!

== 13:07:03-Feb-07-2023 == [gtfToGenePred] gtfToGenePred -ignoreGroupsWithoutExons /gpfs_common/share01/hwsedero/abudnic/Lotusjaponicus_Gifu_v1.3_predictedProteins.agat.convert.gtf /gpfs_common/share01/hwsedero/abudnic/221220_Lj_Leaf_Isocirc/isocirc_output/Lotusjaponicus_Gifu_v1.3_predictedProteins.agat.convert.gtf.gene_pred == 13:07:06-Feb-07-2023 == [genePredToBed] genePredToBed /gpfs_common/share01/hwsedero/abudnic/221220_Lj_Leaf_Isocirc/isocirc_output/Lotusjaponicus_Gifu_v1.3_predictedProteins.agat.convert.gtf.gene_pred /gpfs_common/share01/hwsedero/abudnic/221220_Lj_Leaf_Isocirc/isocirc_output/Lotusjaponicus_Gifu_v1.3_predictedProteins.agat.convert.gtf.bed == 13:07:06-Feb-07-2023 == [get_transcript_from_bed12] Loading transcript from /gpfs_common/share01/hwsedero/abudnic/221220_Lj_Leaf_Isocirc/isocirc_output/Lotusjaponicus_Gifu_v1.3_predictedProteins.agat.convert.gtf.gene_pred ... Traceback (most recent call last): File "/usr/local/usrapps/hwsedero/CIRIlong/bin/isocirc", line 219, in main() File "/usr/local/usrapps/hwsedero/CIRIlong/bin/isocirc", line 216, in main isocirc_core(args) File "/usr/local/usrapps/hwsedero/CIRIlong/bin/isocirc", line 132, in isocirc_core hf.hcBSJ_fullIso(high_bam, low_bam, long_len_fn, cons_info, cons_fa, File "/usr/local/usrapps/hwsedero/CIRIlong/lib/python3.9/site-packages/isocirc/hcBSJ_fullIso.py", line 771, in hcBSJ_fullIso all_trans = pg.get_transcript_from_gene_pred(all_anno_gene_pred, high_bam) File "/usr/local/usrapps/hwsedero/CIRIlong/lib/python3.9/site-packages/isocirc/parse_gff.py", line 392, in get_transcript_from_gene_pred gene_id = ele[pred_header['geneID']] IndexError: list index out of range

yangao07 commented 1 year ago

So sorry that this is a careless bug. Just fixed it (v1.0.6).

Asa-Budnick commented 1 year ago

Hello Yan, Thanks for your ongoing efforts,

I've run the updated v1.0.6 again and ran into a different but similar issue. It may also be associated with a lack of index files which apparently did not get generated for the high.bam and low.bam. Could you take a look?

== 00:48:01-Feb-09-2023 == [classify_bam_core] 3700000 BAM records done ... == 00:48:03-Feb-09-2023 == [classify_bam_core] 3800000 BAM records done ... == 00:48:06-Feb-09-2023 == [classify_bam_core] 3900000 BAM records done ... == 00:48:08-Feb-09-2023 == [classify_bam_core] 4000000 BAM records done ... == 00:48:09-Feb-09-2023 == [classify_bam_core] Processing /gpfs_common/share01/hwsedero/abudnic/221220_Lj_Leaf_Isocirc/isocirc_output/cons.fa.sam done. == 00:48:09-Feb-09-2023 == [Classifying] Classifying consensus alignment done! == 00:48:48-Feb-09-2023 == [gtfToGenePred] gtfToGenePred -ignoreGroupsWithoutExons /gpfs_common/share01/hwsedero/abudnic/Lotusjaponicus_Gifu_v1.3_predictedProteins.agat.convert.gtf /gpfs_common/share01/hwsedero/abudnic/221220_Lj_Leaf_Isocirc/isocirc_output/Lotusjaponicus_Gifu_v1.3_predictedProteins.agat.convert.gtf.gene_pred == 00:48:50-Feb-09-2023 == [genePredToBed] genePredToBed /gpfs_common/share01/hwsedero/abudnic/221220_Lj_Leaf_Isocirc/isocirc_output/Lotusjaponicus_Gifu_v1.3_predictedProteins.agat.convert.gtf.gene_pred /gpfs_common/share01/hwsedero/abudnic/221220_Lj_Leaf_Isocirc/isocirc_output/Lotusjaponicus_Gifu_v1.3_predictedProteins.agat.convert.gtf.bed == 00:48:50-Feb-09-2023 == [get_transcript_from_gtf] Loading transcript from /gpfs_common/share01/hwsedero/abudnic/Lotusjaponicus_Gifu_v1.3_predictedProteins.agat.convert.gtf ... == 00:48:52-Feb-09-2023 == [get_transcript_from_gtf] Loading transcript from /gpfs_common/share01/hwsedero/abudnic/Lotusjaponicus_Gifu_v1.3_predictedProteins.agat.convert.gtf done! == 00:48:52-Feb-09-2023 == [get_splice_site_from_bed12] Loading splice site from /gpfs_common/share01/hwsedero/abudnic/221220_Lj_Leaf_Isocirc/isocirc_output/Lotusjaponicus_Gifu_v1.3_predictedProteins.agat.convert.gtf.bed ... [E::idx_find_and_load] Could not retrieve index file for '/gpfs_common/share01/hwsedero/abudnic/221220_Lj_Leaf_Isocirc/isocirc_output/high.bam' == 00:48:53-Feb-09-2023 == [get_splice_site_from_bed12] Loading splice site from /gpfs_common/share01/hwsedero/abudnic/221220_Lj_Leaf_Isocirc/isocirc_output/Lotusjaponicus_Gifu_v1.3_predictedProteins.agat.convert.gtf.bed done! == 00:48:53-Feb-09-2023 == [get_splice_junction_from_bed12] Loading splice junction from /gpfs_common/share01/hwsedero/abudnic/221220_Lj_Leaf_Isocirc/isocirc_output/Lotusjaponicus_Gifu_v1.3_predictedProteins.agat.convert.gtf.bed ... [E::idx_find_and_load] Could not retrieve index file for '/gpfs_common/share01/hwsedero/abudnic/221220_Lj_Leaf_Isocirc/isocirc_output/high.bam' == 00:48:53-Feb-09-2023 == [get_splice_junction_from_bed12] Loading splice junction from /gpfs_common/share01/hwsedero/abudnic/221220_Lj_Leaf_Isocirc/isocirc_output/Lotusjaponicus_Gifu_v1.3_predictedProteins.agat.convert.gtf.bed done! == 00:48:53-Feb-09-2023 == [get_exon_from_bed12] Loading exon from /gpfs_common/share01/hwsedero/abudnic/221220_Lj_Leaf_Isocirc/isocirc_output/Lotusjaponicus_Gifu_v1.3_predictedProteins.agat.convert.gtf.bed ... [E::idx_find_and_load] Could not retrieve index file for '/gpfs_common/share01/hwsedero/abudnic/221220_Lj_Leaf_Isocirc/isocirc_output/high.bam' == 00:48:53-Feb-09-2023 == [get_exon_from_bed12] Loading exon from /gpfs_common/share01/hwsedero/abudnic/221220_Lj_Leaf_Isocirc/isocirc_output/Lotusjaponicus_Gifu_v1.3_predictedProteins.agat.convert.gtf.bed done! == 00:48:53-Feb-09-2023 == [get_back_splice_junction_from_bed] Loading splice junction from /gpfs_common/share01/hwsedero/abudnic/isoCirc/test_data/chr16_circRNA_toy.bed ... [E::idx_find_and_load] Could not retrieve index file for '/gpfs_common/share01/hwsedero/abudnic/221220_Lj_Leaf_Isocirc/isocirc_output/high.bam' == 00:48:53-Feb-09-2023 == [get_back_splice_junction_from_bed] Loading splice junction from /gpfs_common/share01/hwsedero/abudnic/isoCirc/test_data/chr16_circRNA_toy.bed done! [E::idx_find_and_load] Could not retrieve index file for '/gpfs_common/share01/hwsedero/abudnic/221220_Lj_Leaf_Isocirc/isocirc_output/high.bam' [E::idx_find_and_load] Could not retrieve index file for '/gpfs_common/share01/hwsedero/abudnic/221220_Lj_Leaf_Isocirc/isocirc_output/low.bam' == 00:49:04-Feb-09-2023 == [read_wise_eval] Generating read-wise evaluation result ... Traceback (most recent call last): File "/usr/local/usrapps/hwsedero/CIRIlong/bin/isocirc", line 219, in <module> main() File "/usr/local/usrapps/hwsedero/CIRIlong/bin/isocirc", line 216, in main isocirc_core(args) File "/usr/local/usrapps/hwsedero/CIRIlong/bin/isocirc", line 132, in isocirc_core hf.hcBSJ_fullIso(high_bam, low_bam, long_len_fn, cons_info, cons_fa, File "/usr/local/usrapps/hwsedero/CIRIlong/lib/python3.9/site-packages/isocirc/hcBSJ_fullIso.py", line 797, in hcBSJ_fullIso eval_core(processed_cnt, r, cons_info_dict, ref_fa, cons_fa, all_site, all_exon, all_sj, circ_sj, sj_xid, key_sj_xid, site_dis, end_dis, all_out) File "/usr/local/usrapps/hwsedero/CIRIlong/lib/python3.9/site-packages/isocirc/hcBSJ_fullIso.py", line 742, in eval_core is_known_bsj, is_cano_bsj, dis_to_cano_bsj, bsj_motif, align_bsj = pg.is_known_cano_bsj(bsj, circ_sj, ref_seq, cons_fa[r.query_name][:].seq.upper(), int(eval_out['startCoor0based']), int(eval_out['endCoor']), r.is_reverse, r.cigartuples, int(eval_out['refMapLen']), int(eval_out['consMapLen']), int(eval_out['consLen']), end_dis, force_strand, bsj_dis_to_known_ss) File "/usr/local/usrapps/hwsedero/CIRIlong/lib/python3.9/site-packages/isocirc/parse_gff.py", line 771, in is_known_cano_bsj score, alignBSJ1 = get_cano_bsj_align(up_dis1, down_dis1, strand1, ref_seq, read_seq, start, end, end_dis, is_reverse, cigartuples, ref_map_len, cons_len) File "/usr/local/usrapps/hwsedero/CIRIlong/lib/python3.9/site-packages/isocirc/parse_gff.py", line 673, in get_cano_bsj_align return pb.pairwise_align(bsj_ref_seq, bsj_read_seq, 'g', True) File "/usr/local/usrapps/hwsedero/CIRIlong/lib/python3.9/site-packages/isocirc/parse_bam.py", line 103, in pairwise_align return res.score, get_cigar_from_pairwise_res(r.format()) File "/usr/local/usrapps/hwsedero/CIRIlong/lib/python3.9/site-packages/isocirc/parse_bam.py", line 63, in get_cigar_from_pairwise_res target_str, query_str, aln_str = ele[2], ele[9], ele[5] IndexError: list index out of range

yangao07 commented 1 year ago

I actually also noticed this error, this is caused by the biopython version conflict. The current code works with biopython < v1.79. I guess you have v1.79 or 1.80. Let me fix this later.

Asa-Budnick commented 1 year ago

Hello Yan, I rolled back my biopython to 1.78 and again to 1.77 after getting the same errors. Here is the error:

== 01:44:18-Feb-12-2023 == [get_back_splice_junction_from_bed] Loading splice junction from /gpfs_common/share01/hwsedero/abudnic/isoCirc/test_data/chr16_circRNA_toy.bed ... [E::idx_find_and_load] Could not retrieve index file for '/gpfs_common/share01/hwsedero/abudnic/221220_Lj_Leaf_Isocirc/isocirc_output/high.bam' == 01:44:18-Feb-12-2023 == [get_back_splice_junction_from_bed] Loading splice junction from /gpfs_common/share01/hwsedero/abudnic/isoCirc/test_data/chr16_circRNA_toy.bed done! [E::idx_find_and_load] Could not retrieve index file for '/gpfs_common/share01/hwsedero/abudnic/221220_Lj_Leaf_Isocirc/isocirc_output/high.bam' [E::idx_find_and_load] Could not retrieve index file for '/gpfs_common/share01/hwsedero/abudnic/221220_Lj_Leaf_Isocirc/isocirc_output/low.bam' == 01:44:35-Feb-12-2023 == [read_wise_eval] Generating read-wise evaluation result ... /usr/local/usrapps/hwsedero/CIRIlong/lib/python3.9/site-packages/Bio/Align/init.py:1277: BiopythonDeprecationWarning: alignment.format has been deprecated, and we intend to remove it in a future release of Biopython. Instead of alignment.format(), please use format(alignment) or an f-string. warnings.warn( Traceback (most recent call last): File "/usr/local/usrapps/hwsedero/CIRIlong/bin/isocirc", line 219, in main() File "/usr/local/usrapps/hwsedero/CIRIlong/bin/isocirc", line 216, in main isocirc_core(args) File "/usr/local/usrapps/hwsedero/CIRIlong/bin/isocirc", line 132, in isocirc_core hf.hcBSJ_fullIso(high_bam, low_bam, long_len_fn, cons_info, cons_fa, File "/usr/local/usrapps/hwsedero/CIRIlong/lib/python3.9/site-packages/isocirc/hcBSJ_fullIso.py", line 797, in hcBSJ_fullIso eval_core(processed_cnt, r, cons_info_dict, ref_fa, cons_fa, all_site, all_exon, all_sj, circ_sj, sj_xid, key_sj_xid, site_dis, end_dis, all_out) File "/usr/local/usrapps/hwsedero/CIRIlong/lib/python3.9/site-packages/isocirc/hcBSJ_fullIso.py", line 742, in eval_core is_known_bsj, is_cano_bsj, dis_to_cano_bsj, bsj_motif, align_bsj = pg.is_known_cano_bsj(bsj, circ_sj, ref_seq, cons_fa[r.query_name][:].seq.upper(), int(eval_out['startCoor0based']), int(eval_out['endCoor']), r.is_reverse, r.cigartuples, int(eval_out['refMapLen']), int(eval_out['consMapLen']), int(eval_out['consLen']), end_dis, force_strand, bsj_dis_to_known_ss) File "/usr/local/usrapps/hwsedero/CIRIlong/lib/python3.9/site-packages/isocirc/parse_gff.py", line 771, in is_known_cano_bsj score, alignBSJ1 = get_cano_bsj_align(up_dis1, down_dis1, strand1, ref_seq, read_seq, start, end, end_dis, is_reverse, cigartuples, ref_map_len, cons_len) File "/usr/local/usrapps/hwsedero/CIRIlong/lib/python3.9/site-packages/isocirc/parse_gff.py", line 673, in get_cano_bsj_align return pb.pairwise_align(bsj_ref_seq, bsj_read_seq, 'g', True) File "/usr/local/usrapps/hwsedero/CIRIlong/lib/python3.9/site-packages/isocirc/parse_bam.py", line 103, in pairwise_align return res.score, get_cigar_from_pairwise_res(r.format()) File "/usr/local/usrapps/hwsedero/CIRIlong/lib/python3.9/site-packages/isocirc/parse_bam.py", line 63, in get_cigar_from_pairwise_res target_str, query_str, aln_str = ele[2], ele[9], ele[5] IndexError: list index out of range

Please let me know if you have troubleshooting recommendations, thanks!