annotation issue? - Githubissues

KateK commented 3 years ago

Hello,

I faced this kind of problem:

== 16:16:34-Apr-14-2021 == [gtf2gene] gtf2gene nanopore_circ/isocirc.bed.exon.gtf AT.gtf nanopore_circ/isocirc.bed.ovlp.gene.out
Traceback (most recent call last):
  File "/opt/exp_soft/local/generic/python/3.9.2/bin/isocirc", line 219, in <module>
    main()
  File "/opt/exp_soft/local/generic/python/3.9.2/bin/isocirc", line 216, in main
    isocirc_core(args)
  File "/opt/exp_soft/local/generic/python/3.9.2/bin/isocirc", line 132, in isocirc_core
    hf.hcBSJ_fullIso(high_bam, low_bam, long_len_fn, cons_info, cons_fa,
  File "/opt/exp_soft/local/generic/python/3.9.2/lib/python3.9/site-packages/isocirc/hcBSJ_fullIso.py", line 826, in hcBSJ_fullIso
    itst_out_dict = intersect_with_bed(out_dir, circRNA_bed, all_anno, all_anno_bed, itst_anno_dict, flank_len, bedtools)
  File "/opt/exp_soft/local/generic/python/3.9.2/lib/python3.9/site-packages/isocirc/hcBSJ_fullIso.py", line 414, in intersect_with_bed
    get_ovlp_gene_name_id(ovlp_gene_name_id, gene_id_dict, gene_name_dict, gene_strand_dict)
  File "/opt/exp_soft/local/generic/python/3.9.2/lib/python3.9/site-packages/isocirc/hcBSJ_fullIso.py", line 214, in get_ovlp_gene_name_id
    strand_dict[ele[0]] = ele[3] if strand_dict[ele[0]] == 'NA' else strand_dict[ele[0]] + ',' + ele[3]
IndexError: list index out of range

I assume that this is a problem with gtf file (maybe chromosome format ?). Could You please tell me how to manage? Also can I run the piplene from this point without repeating previous steps if they are good?

yangao07 commented 3 years ago

Can you show me a few lines of these three files: nanopore_circ/isocirc.bed.exon.gtf, AT.gtf, nanopore_circ/isocirc.bed.ovlp.gene.out?

KateK commented 3 years ago

nanopore_circ/isocirc.bed.exon.gtf

Pt  isocirc exon    15483   15772   .   +   .   gene_id "isocirc
0"; transcript_id "isocirc0"; exon_number "1"; exon_id "isocirc0.1";
Pt  isocirc exon    21057   21240   .   -   .   gene_id "isocirc
1"; transcript_id "isocirc1"; exon_number "1"; exon_id "isocirc1.1";
Pt  isocirc exon    23983   24183   .   -   .   gene_id "isocirc
2"; transcript_id "isocirc2"; exon_number "1"; exon_id "isocirc2.1";
Pt  isocirc exon    34822   34978   .   -   .   gene_id "isocirc
3"; transcript_id "isocirc3"; exon_number "1"; exon_id "isocirc3.1";
Pt  isocirc exon    39893   40107   .

AT.gtf

#!genome-build TAIR10
#!genome-version TAIR10
#!genome-date 2010-09
#!genome-build-accession GCA_000001735.1
#!genebuild-last-updated 2010-09
1   araport11   gene    3631    5899    .   +   .   gene_id 
"AT1G01010"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_
coding";
1   araport11   transcript  3631    5899    .   +   .   
gene_id "AT1G01010"; transcript_id "AT1G01010.1"; gene_name "NAC001"; gene_sourc
e "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; tra
nscript_biotype "protein_coding";
1   araport11   exon    3631    3913    .   +   .   gene_id 
"AT1G01010"; transcript_id "AT1G01010.1"; exon_number "1"; gene_name "NAC001"; g
ene_source "araport11"; gene_biotype "protein_coding"; transcript_source "arapor
t11"; transcript_biotype "protein_coding"; exon_id "AT1G01010.1.exon1";
1   araport11   CDS 3760    3913    .   +   0   gene_id 
"AT1G01010"; transcript_id "AT1G01010.1"; exon_number "1"; gene_name "NAC001"; g
ene_source "araport11"; gene_biotype "protein_coding"; transcript_source "arapor
t11"; transcript_biotype "protein_coding"; protein_id "AT1G01010.1";

nanopore_circ/isocirc.bed.ovlp.gene.out

isocirc10   ATCG01180   RRN23S.2    -
isocirc11   ATCG00950   RRN23S.1    +
isocirc12   ATCG00950   RRN23S.1    +
isocirc13   ATCG00950   RRN23S.1    +
isocirc14   ATCG01110   NDHH    -
isocirc15   ATCG01180   RRN23S.2    -
isocirc17   ATCG01180   RRN23S.2    -

yangao07 commented 3 years ago

They look normal to me. Can you also try to do this: awk '(NF != 4)' nanopore_circ/isocirc.bed.ovlp.gene.out | head and let me know what you get.

KateK commented 3 years ago

isocirc39   AT5G46730       +
isocirc40   AT5G46730       +
isocirc41   AT5G46730       +
isocirc42   AT5G46730       +
isocirc43   AT5G46730       +
isocirc44   AT5G46730       +
isocirc45   AT5G46730       +
isocirc46   AT5G46730       +
isocirc48   AT5G51530       -
isocirc49   AT5G51530       -

yangao07 commented 3 years ago

OK, I think I know what happens here. Some of the genes in your GTF file AT.gtf may only have gene_id but no gene_name tags. You can try grep AT5G46730 AT.gtf | grep gene_name to see if there are gene_name tags.

KateK commented 3 years ago

You've got the point. I see no output. How to handle that? Do I have to assign manually gene names?

Can I run the pipeline from this point without repeating previous steps?

yangao07 commented 3 years ago

For now, isoCirc require both gene name and gene id to be in the GTF file. So you have to manually add those names. I think you can simply copy the gene id as the gene name.

Actually, you can run isoCirc from this step if you have the source code downloaded from github. Then try this: python /path/to/isocirc_repo/isoCirc_pipeline/isocirc/hcBSJ_fullIso.py, replace /path/to/isocirc_repo as your path. The input of this script is what you have generated in the previous steps, and the last 3 positional arguments are the 3 output files.

KateK commented 3 years ago

Thanks! I menaged. I have three more questions.

In my isocirc_stats.out see that from almost 5mln reads i got 1.7mln reads with consensus and only 16 253 mapabble reads with consensus. At least i got 3500 candidate of BSJs and only 49 high confidence. Does it mean that my data is poor quality, or is it normal result ? What are the rest sequences that didn't produce consensus? Is there a possibility that we have longer circRNAs that didn't get into rolling circle?
I see in bed file that the score value everywere is 0. If they are high quality , why i got 0 points score.

I also tried to draw a plot, but I also got an error (i used modified gtf) :


== 15:36:44-Apr-15-2021 == [Gene structure from annotation file] grep NA ../../backup/ReferenceGenomes/reference_at/Arabidopsis_thaliana.TAIR10.35.isocirc.gtf > .//gene.gtf
Traceback (most recent call last):
File "/home/kasia/miniconda2/bin/isocircPlot", line 674, in <module>
align_details, all_struct, isoform_struct, gene_struct = align_to_circRNA_fa(ref_fa, anno_gtf, read_fa, read_fa_len, isocirc_bed, isocirc_read_list, out_dir, circRNA_ref_fa)
File "/home/kasia/miniconda2/bin/isocircPlot", line 631, in align_to_circRNA_fa
ref_seq = ref_seqs[ref_name]
KeyError: 'isocirc0'

yangao07 commented 3 years ago

If your reference genome and GTF file are matched, I am afraid that the reason is indeed the data quality is not very good.

For the plot, can you show me a few lines of the file you provided to the plotting script?

KateK commented 3 years ago

I used the same file like in original isocirc pipeline, but I converted it from fastq to fasta. Should I prepare another fasta file with reads mapping to the region of circRNA and use "isocircX" in header?

yangao07 commented 3 years ago

== 15:36:44-Apr-15-2021 == [Gene structure from annotation file] grep NA ../../backup/ReferenceGenomes/reference_at/Arabidopsis_thaliana.TAIR10.35.isocirc.gtf > .//gene.gtf

I think the problem is in your list file. The gene name should not be NA.

yangao07 commented 3 years ago

2. I see in bed file that the score value everywere is 0. If they are high quality , why i got 0 points score.

isoCirc does not assign scores to the bed records, so they are always 0.

KateK commented 3 years ago

When I change to another circ I got the same error. I copied info from isocirc.out file about gene name and reads.

== 14:17:33-Apr-16-2021 == [Gene structure from annotation file] grep HCF152 ../../backup/ReferenceGenomes/reference_at/Arabidopsis_thaliana.TAIR10.35.isocirc.gtf > .//gene.gtf
Traceback (most recent call last):
  File "/home/kasia/miniconda2/bin/isocircPlot", line 674, in <module>
    align_details, all_struct, isoform_struct, gene_struct = align_to_circRNA_fa(ref_fa, anno_gtf, read_fa, read_fa_len, isocirc_bed, isocirc_read_list, out_dir, circRNA_ref_fa)
  File "/home/kasia/miniconda2/bin/isocircPlot", line 631, in align_to_circRNA_fa
    ref_seq = ref_seqs[ref_name]
KeyError: 'isocirc27'

And I got generated file gene.gtf so the grep command worked.

3   araport11   gene    2958676 2961299 .   +   .   gene_id "AT3G09650"; gene_name "HCF152"; gene_source "araport11"; gene_biotype "protein_coding";
3   araport11   transcript  2958676 2961299 .   +   .   gene_id "AT3G09650"; transcript_id "AT3G09650.1"; gene_name "HCF152"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding";
3   araport11   exon    2958676 2961299 .   +   .   gene_id "AT3G09650"; transcript_id "AT3G09650.1"; exon_number "1"; gene_name "HCF152"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT3G09650.1.exon1";
3   araport11   CDS 2958704 2961037 .   +   0   gene_id "AT3G09650"; transcript_id "AT3G09650.1"; exon_number "1"; gene_name "HCF152"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT3G09650.1";
3   araport11   start_codon 2958704 2958706 .   +   0   gene_id "AT3G09650"; transcript_id "AT3G09650.1"; exon_number "1"; gene_name "HCF152"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding";
3   araport11   stop_codon  2961038 2961040 .   +   0   gene_id "AT3G09650"; transcript_id "AT3G09650.1"; exon_number "1"; gene_name "HCF152"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding";
3   araport11   five_prime_utr  2958676 2958703 .   +   .   gene_id "AT3G09650"; transcript_id "AT3G09650.1"; gene_name "HCF152"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding";
3   araport11   three_prime_utr 2961041 2961299 .   +   .   gene_id "AT3G09650"; transcript_id "AT3G09650.1"; gene_name "HCF152"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding";

yangao07 commented 3 years ago

Can you show me what the circRNA_ref.fa.fai look like by cat circRNA_ref.fa.fai?

KateK commented 3 years ago

isocirc27::3:2959612-2959797 185 30 185 186

yangao07 commented 3 years ago

This is actually a known bug that has been fixed in the current version. Are you using an older version of isoCirc?

KateK commented 3 years ago

I got 1.0.1

yangao07 commented 3 years ago

OK, then you can try to update it and re-plot.

Xinglab / isoCirc

annotation issue? #3