Closed KateK closed 1 year ago
Can you show me a few lines of these three files: nanopore_circ/isocirc.bed.exon.gtf
, AT.gtf
, nanopore_circ/isocirc.bed.ovlp.gene.out
?
nanopore_circ/isocirc.bed.exon.gtf
Pt isocirc exon 15483 15772 . + . gene_id "isocirc
0"; transcript_id "isocirc0"; exon_number "1"; exon_id "isocirc0.1";
Pt isocirc exon 21057 21240 . - . gene_id "isocirc
1"; transcript_id "isocirc1"; exon_number "1"; exon_id "isocirc1.1";
Pt isocirc exon 23983 24183 . - . gene_id "isocirc
2"; transcript_id "isocirc2"; exon_number "1"; exon_id "isocirc2.1";
Pt isocirc exon 34822 34978 . - . gene_id "isocirc
3"; transcript_id "isocirc3"; exon_number "1"; exon_id "isocirc3.1";
Pt isocirc exon 39893 40107 .
AT.gtf
#!genome-build TAIR10
#!genome-version TAIR10
#!genome-date 2010-09
#!genome-build-accession GCA_000001735.1
#!genebuild-last-updated 2010-09
1 araport11 gene 3631 5899 . + . gene_id
"AT1G01010"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_
coding";
1 araport11 transcript 3631 5899 . + .
gene_id "AT1G01010"; transcript_id "AT1G01010.1"; gene_name "NAC001"; gene_sourc
e "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; tra
nscript_biotype "protein_coding";
1 araport11 exon 3631 3913 . + . gene_id
"AT1G01010"; transcript_id "AT1G01010.1"; exon_number "1"; gene_name "NAC001"; g
ene_source "araport11"; gene_biotype "protein_coding"; transcript_source "arapor
t11"; transcript_biotype "protein_coding"; exon_id "AT1G01010.1.exon1";
1 araport11 CDS 3760 3913 . + 0 gene_id
"AT1G01010"; transcript_id "AT1G01010.1"; exon_number "1"; gene_name "NAC001"; g
ene_source "araport11"; gene_biotype "protein_coding"; transcript_source "arapor
t11"; transcript_biotype "protein_coding"; protein_id "AT1G01010.1";
nanopore_circ/isocirc.bed.ovlp.gene.out
isocirc10 ATCG01180 RRN23S.2 -
isocirc11 ATCG00950 RRN23S.1 +
isocirc12 ATCG00950 RRN23S.1 +
isocirc13 ATCG00950 RRN23S.1 +
isocirc14 ATCG01110 NDHH -
isocirc15 ATCG01180 RRN23S.2 -
isocirc17 ATCG01180 RRN23S.2 -
They look normal to me.
Can you also try to do this: awk '(NF != 4)' nanopore_circ/isocirc.bed.ovlp.gene.out | head
and let me know what you get.
isocirc39 AT5G46730 +
isocirc40 AT5G46730 +
isocirc41 AT5G46730 +
isocirc42 AT5G46730 +
isocirc43 AT5G46730 +
isocirc44 AT5G46730 +
isocirc45 AT5G46730 +
isocirc46 AT5G46730 +
isocirc48 AT5G51530 -
isocirc49 AT5G51530 -
OK, I think I know what happens here.
Some of the genes in your GTF file AT.gtf
may only have gene_id
but no gene_name
tags.
You can try grep AT5G46730 AT.gtf | grep gene_name
to see if there are gene_name
tags.
You've got the point. I see no output. How to handle that? Do I have to assign manually gene names?
Can I run the pipeline from this point without repeating previous steps?
For now, isoCirc require both gene name and gene id to be in the GTF file. So you have to manually add those names. I think you can simply copy the gene id as the gene name.
Actually, you can run isoCirc from this step if you have the source code downloaded from github.
Then try this: python /path/to/isocirc_repo/isoCirc_pipeline/isocirc/hcBSJ_fullIso.py
, replace /path/to/isocirc_repo
as your path.
The input of this script is what you have generated in the previous steps, and the last 3 positional arguments are the 3 output files.
Thanks! I menaged. I have three more questions.
== 15:36:44-Apr-15-2021 == [Gene structure from annotation file] grep NA ../../backup/ReferenceGenomes/reference_at/Arabidopsis_thaliana.TAIR10.35.isocirc.gtf > .//gene.gtf
Traceback (most recent call last):
File "/home/kasia/miniconda2/bin/isocircPlot", line 674, in <module>
align_details, all_struct, isoform_struct, gene_struct = align_to_circRNA_fa(ref_fa, anno_gtf, read_fa, read_fa_len, isocirc_bed, isocirc_read_list, out_dir, circRNA_ref_fa)
File "/home/kasia/miniconda2/bin/isocircPlot", line 631, in align_to_circRNA_fa
ref_seq = ref_seqs[ref_name]
KeyError: 'isocirc0'
If your reference genome and GTF file are matched, I am afraid that the reason is indeed the data quality is not very good.
For the plot, can you show me a few lines of the file you provided to the plotting script?
I used the same file like in original isocirc pipeline, but I converted it from fastq to fasta. Should I prepare another fasta file with reads mapping to the region of circRNA and use "isocircX" in header?
== 15:36:44-Apr-15-2021 == [Gene structure from annotation file] grep NA ../../backup/ReferenceGenomes/reference_at/Arabidopsis_thaliana.TAIR10.35.isocirc.gtf > .//gene.gtf
I think the problem is in your list file. The gene name should not be NA
.
2. I see in bed file that the score value everywere is 0. If they are high quality , why i got 0 points score.
isoCirc does not assign scores to the bed records, so they are always 0.
When I change to another circ I got the same error. I copied info from isocirc.out file about gene name and reads.
== 14:17:33-Apr-16-2021 == [Gene structure from annotation file] grep HCF152 ../../backup/ReferenceGenomes/reference_at/Arabidopsis_thaliana.TAIR10.35.isocirc.gtf > .//gene.gtf
Traceback (most recent call last):
File "/home/kasia/miniconda2/bin/isocircPlot", line 674, in <module>
align_details, all_struct, isoform_struct, gene_struct = align_to_circRNA_fa(ref_fa, anno_gtf, read_fa, read_fa_len, isocirc_bed, isocirc_read_list, out_dir, circRNA_ref_fa)
File "/home/kasia/miniconda2/bin/isocircPlot", line 631, in align_to_circRNA_fa
ref_seq = ref_seqs[ref_name]
KeyError: 'isocirc27'
And I got generated file gene.gtf so the grep command worked.
3 araport11 gene 2958676 2961299 . + . gene_id "AT3G09650"; gene_name "HCF152"; gene_source "araport11"; gene_biotype "protein_coding";
3 araport11 transcript 2958676 2961299 . + . gene_id "AT3G09650"; transcript_id "AT3G09650.1"; gene_name "HCF152"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding";
3 araport11 exon 2958676 2961299 . + . gene_id "AT3G09650"; transcript_id "AT3G09650.1"; exon_number "1"; gene_name "HCF152"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT3G09650.1.exon1";
3 araport11 CDS 2958704 2961037 . + 0 gene_id "AT3G09650"; transcript_id "AT3G09650.1"; exon_number "1"; gene_name "HCF152"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT3G09650.1";
3 araport11 start_codon 2958704 2958706 . + 0 gene_id "AT3G09650"; transcript_id "AT3G09650.1"; exon_number "1"; gene_name "HCF152"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding";
3 araport11 stop_codon 2961038 2961040 . + 0 gene_id "AT3G09650"; transcript_id "AT3G09650.1"; exon_number "1"; gene_name "HCF152"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding";
3 araport11 five_prime_utr 2958676 2958703 . + . gene_id "AT3G09650"; transcript_id "AT3G09650.1"; gene_name "HCF152"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding";
3 araport11 three_prime_utr 2961041 2961299 . + . gene_id "AT3G09650"; transcript_id "AT3G09650.1"; gene_name "HCF152"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding";
Can you show me what the circRNA_ref.fa.fai
look like by cat circRNA_ref.fa.fai
?
isocirc27::3:2959612-2959797 185 30 185 186
This is actually a known bug that has been fixed in the current version. Are you using an older version of isoCirc?
I got 1.0.1
OK, then you can try to update it and re-plot.
Hello,
I faced this kind of problem:
I assume that this is a problem with gtf file (maybe chromosome format ?). Could You please tell me how to manage? Also can I run the piplene from this point without repeating previous steps if they are good?